You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by "Eric S. Raymond" <es...@thyrsus.com> on 2011/12/13 01:26:52 UTC

Problems with the documentation of Subversion dump format

I have just finished writing a full parser for Subversion dumpfiles.
The next release of reposurgeon will have the ability to read them
directly, though not to write them.

In the process, I've looked very closely at the file 

   https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt

and discovered  a number of problems with it.  I have commit
privileges on the Subversion repo; I was given them in connection
with svncutter.  I'm willing to fix up that file, but want to check
that I wouldn't be stepping on any toes by doing so.  

My notes on the format follow for review by whoever is the responsible
maintainer.  Please look in particular at the sections bracketed with 
[? and ?].

# The Subversion dumpfile format is documented at
#
# https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
#
# but there are a number of points on which that document is incomplete or
# vague. The following notes fill in some gaps and document the assumptions
# on which reposurgeon's code is based.
#
# Below, [? ?] flags assertions I am using but relatively unsure
# of. These need to be checked further
#
# First, syntax.  It is implied, but not expressed, that Revision and Node
# records are in an RFC822-like format - headers followed by a spacer line
# followed by a body.  A Revision record begins with a Revision-number
# line; a Node record begins with a Node-path line.
#
# Each header normally ends with a Content-Length line giving the
# length of the record body in chars *excluding the spacer line*.
# But some records can never have a body and thus have no content
# length. A node describing a copy operation ends with a Node-copyfrom-path
# line and has no content. A node describing a delete action ends with the
# Node-action line and has no content. Each of these records must still be
# followed by a spacer line.  [?These are the only records that can end
# without a Conent-Length line.?]
#
# The body of a Revision record consists entirely of a property
# section.  The body of a Node record consists of an optional property
# section followed by an optional text section (one of the two will
# always be present, otherwise the node would be a no-op). When a
# properties section is present, its portion of the record length is
# given by a Prop-content-length header.  When a text section is
# present, its portion of the record length is given by a
# Text-content-length header.  A property-section is always terminated
# with PROPS-END\n; the length of that terminator is *included* in the
# Prop-content-length.
#
# A properties section consists of a sequence of paired K and V (key and
# value) records.  The header of each record is a body length.  The body
# begins on the next line and is an uninterpreted byte stream of the
# specified length.  A spacer \n is always inserted after the body
# so the next K or V record (or the terminator) will begin at the start
# of a text line. The last line is always PROPS-END\n.
#
# The Properties section of a Revision record consists of some subset
# of the three reserved per-commit properties: svn:author, svn:date,
# and svn.log.  Because a Revision record has no text follows that the
# lengths given in Prop-content-length and Content-length are always
# the same.
#
# Then, semantics.  The three areas where the existing documentation
# is somewhat vague are (a) the persistence of properties, and in
# particular how to delete them, (b) the meaning of the actions ("change",
# "add", "delete", "replace"), and interpretation of (c) copypath/copyrev
# properties.
#
# The key thing to know about properties is that the format re-lists
# the entire property set (after modification) for a directory or file
# in every node record that changes either property or text for that
# file.
#
# This implies that to delete a given property from a path, a dumpfile
# generator will issue a node with all other properties listed in it;
# to delete all properties from a path, the dumpfile generator will
# simply issue a node with an empty properties section. Note that this
# is different from an *absent* properties section, which will change
# no properties and will be associated with a change to content!
#
# Text sections work the same way.  When present, a text section on a
# file node changes the contents of the file; an absent text section
# means only the file properties change.
#
# The "add" action is used to add new directories and file content.
# Directory adds never have text content; file adds always do.  Either type
# may have properties [?but the Subversion client tools never generate
# an add node with properties?].
#
# The "change" action changes text or properties or both.  It may also
# be used on a directory copy, meaning that the contents of the copy
# should add to and not replace the contents of the target directory.
#
# The "delete" action removes the path and never has properties, as
# they would vanish along with the path.
#
# The "replace" action [?is only issued with directory copies, and?]
# signifies that the existing contents of the directory should be
# removed before the copy.
#
# Interpreting copyfrom_path for file copies is straightforward; the
# target pathname gets the contents of the source pathname.
#
# Directory copies (the primitive beneath branching and tagging) are
# tricky.  For each source path under the source directory, a new path
# is generated by removing the head segment of the pathname that is
# the source directory.  That new path under the target directory gets
# the content of the source path.
#
# A single revision may include multiple copyfrom nodes, even multiple
# copyfroms to the same directory, even mixed directory and file copies
# to the same directory; [?Subversion client tools never generate such
# mixed copies, but?] I have seen the results of cvs2svn doing it. 
#
# Note: The Subversion notes show a Node record always ending with
# a Content-length header.  This is erroneous (node records can end with
# a Node-copyfrom-path or Node-action line) and may represent a bug.

-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

The spirit of resistance to government is so valuable on certain occasions, 
that I wish it always to be kept alive.  It will often be exercised when 
wrong, but better so than not to be exercised at all. I like a little 
rebellion now and then.	-- Thomas Jefferson, letter to Abigail Adams, 1787

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric S. Raymond wrote on Tue, Dec 13, 2011 at 16:24:00 -0500:
> Here's what's going on. [...]

Thanks for the background; interesting.  I've just read it and will read
it again more carefully tomorrow.

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

It is also worth pointing out that your response, Stefan, is very will 
written and reasonable and I'm putting it on my "to think about" pile. 
Thank you.

On 12/31/2011 07:10 AM, Stefan Sperling wrote:
> To fix your svn:author problem, you or someone else in this community
> could try to come up with a useful set of conventions for storing extra
> information in svn:author or another revision property, and what the syntax
> to store such information would look like. Because, as you already pointed
> out, your problem is rooted in a lack of conventions, so this is what we'd
> need to address. If needed, also specify a way of how Subversion could be
> configured by users to optionally enable this new feature so users can reap
> the associated benefits. If someone writes a nice spec we can file an
> enhancement request in the issue tracker asking for someone to implement it.
> But if the spec touches on unrelated aspects (such as merging moves), I'd
> suggest to put those in a separate set of suggestions and dev@ threads.
>
> To fix your merging moves problem, you could join the currently on-going
> efforts to fix it. In particular we are currently working, and are looking
> for help, in these areas:
>    1) Making muti-layer nested local moves properly (e.g. moves within locally
>       copied subtrees) -- simpler local move situations are already implemented.
>    2) Detecting server-side moves -- some prototype code exists but there
>       are many things left to design, specify, and implement, especially
>       regarding the means we're going to offer users to solve tree conflicts
>       involving moves.
> While it is always welcome, there is no need to go to the effort of
> contributing code. Useful contributions can be made with much less effort.
> For instance, it helps a lot if you think about aspects of this (very large)
> problem space and try to describe how you'd like Subversion to behave in
> use cases which are important to you. Some, but not all, of the current
> behaviour design around moves is made without user input. For many use cases
> no proper design even exists yet. So more input is definitely welcome.
> And *now* is the time to submit your input, before a lot of code has been
> written that implements behaviour which may or may not turn out to be
> ideal for your situation. Thanks.


-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

Hi Stefan:

You are correct. I've let my frustration get to me ... after years and 
years of waiting and trying to contribute. I agree that the route 
forward you describe is the best approach to getting something achieved.

However, I do think it is worth pointing out - very briefly and with 
purpose - that there are some aspects that I have seen on this list 
which clearly show that the developers of Subversion primarily work from 
a simpler requirement set than the developers of other systems, and this 
can be quite frustrating to deal with. Things that are clear to others - 
such as proper merging across branches and specifically across renames, 
so-called "merge tracking", are core functions in other systems but are 
bolted on after thoughts in Subversion.

The "svn:author" issue is yet another example of the same. Other systems 
have as a core feature a "structured" author value. Unique id portion 
plus display portion at least. Subversion chose to leave it as a free 
text field which has traditionally only included the unique id portion. 
"Good enough" for small teams. Progressively worse the larger the team.

I agree it's not worth ranting. It isn't productive. Part of what I say 
is a rant. Part of what I say, though, is hope that by explaining the 
problems a bit more, the people who do contribute more can take these 
requirements into account during the requirements analysis phase instead 
of after the fact as a bug report, and get it right the first time.

Personally, I'm undecided. I joined the list years ago because I 
intended to contribute. The philosophical differences have been hard, 
though, and I've had trouble justifying to myself that if I were to 
spend 40 hours on Subversion development today, that it would be a 
worthwhile investment on my part. Honestly, I'm having real trouble 
explaining why I would purposefully stay with Subversion when I have 
options that do not have these problems that could also benefit from 
investment, or might not even require investment as they acknowledged 
the problems in their design and addressed it first and foremost rather 
than as an after thought.

Pretty negative. Sorry. On the positive side - I've also seen a few of 
you really try to attack the hard design problems. Subversion 1.5, 1.6, 
and 1.7 are pretty major steps forwards. They're just not as major as 
was hoped. :-) Up hill battle all the way. That's really difficult to 
consider rewarding. :-)

Sorry all for the interruption.

mark

On 12/31/2011 07:10 AM, Stefan Sperling wrote:
> On Fri, Dec 30, 2011 at 08:22:50PM -0500, Mark Mielke wrote:
>> In any case - this is just yet another example of how Subversion
>> really doesn't scale. That it still can't properly merge across
>> branches or renames is much more important...
> Mark, are you trying to make a useful contribution here on the dev@ list?
> The above digression makes you sound more like you were here to complain
> about random and very loosely specified aspects you don't like about
> Subversion's behaviour. This isn't productive and is not going to fix
> any problems.
> See http://www.producingoss.com/en/common-pitfalls.html#productive-threads
>
> To fix your svn:author problem, you or someone else in this community
> could try to come up with a useful set of conventions for storing extra
> information in svn:author or another revision property, and what the syntax
> to store such information would look like. Because, as you already pointed
> out, your problem is rooted in a lack of conventions, so this is what we'd
> need to address. If needed, also specify a way of how Subversion could be
> configured by users to optionally enable this new feature so users can reap
> the associated benefits. If someone writes a nice spec we can file an
> enhancement request in the issue tracker asking for someone to implement it.
> But if the spec touches on unrelated aspects (such as merging moves), I'd
> suggest to put those in a separate set of suggestions and dev@ threads.
>
> To fix your merging moves problem, you could join the currently on-going
> efforts to fix it. In particular we are currently working, and are looking
> for help, in these areas:
>    1) Making muti-layer nested local moves properly (e.g. moves within locally
>       copied subtrees) -- simpler local move situations are already implemented.
>    2) Detecting server-side moves -- some prototype code exists but there
>       are many things left to design, specify, and implement, especially
>       regarding the means we're going to offer users to solve tree conflicts
>       involving moves.
> While it is always welcome, there is no need to go to the effort of
> contributing code. Useful contributions can be made with much less effort.
> For instance, it helps a lot if you think about aspects of this (very large)
> problem space and try to describe how you'd like Subversion to behave in
> use cases which are important to you. Some, but not all, of the current
> behaviour design around moves is made without user input. For many use cases
> no proper design even exists yet. So more input is definitely welcome.
> And *now* is the time to submit your input, before a lot of code has been
> written that implements behaviour which may or may not turn out to be
> ideal for your situation. Thanks.

-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Stefan Sperling <st...@elego.de>.

On Fri, Dec 30, 2011 at 08:22:50PM -0500, Mark Mielke wrote:
> In any case - this is just yet another example of how Subversion
> really doesn't scale. That it still can't properly merge across
> branches or renames is much more important...

Mark, are you trying to make a useful contribution here on the dev@ list?
The above digression makes you sound more like you were here to complain
about random and very loosely specified aspects you don't like about
Subversion's behaviour. This isn't productive and is not going to fix
any problems.
See http://www.producingoss.com/en/common-pitfalls.html#productive-threads

To fix your svn:author problem, you or someone else in this community
could try to come up with a useful set of conventions for storing extra
information in svn:author or another revision property, and what the syntax
to store such information would look like. Because, as you already pointed
out, your problem is rooted in a lack of conventions, so this is what we'd
need to address. If needed, also specify a way of how Subversion could be
configured by users to optionally enable this new feature so users can reap
the associated benefits. If someone writes a nice spec we can file an
enhancement request in the issue tracker asking for someone to implement it.
But if the spec touches on unrelated aspects (such as merging moves), I'd
suggest to put those in a separate set of suggestions and dev@ threads.

To fix your merging moves problem, you could join the currently on-going
efforts to fix it. In particular we are currently working, and are looking
for help, in these areas:
  1) Making muti-layer nested local moves properly (e.g. moves within locally
     copied subtrees) -- simpler local move situations are already implemented.
  2) Detecting server-side moves -- some prototype code exists but there
     are many things left to design, specify, and implement, especially
     regarding the means we're going to offer users to solve tree conflicts
     involving moves.
While it is always welcome, there is no need to go to the effort of
contributing code. Useful contributions can be made with much less effort.
For instance, it helps a lot if you think about aspects of this (very large)
problem space and try to describe how you'd like Subversion to behave in
use cases which are important to you. Some, but not all, of the current
behaviour design around moves is made without user input. For many use cases
no proper design even exists yet. So more input is definitely welcome.
And *now* is the time to submit your input, before a lot of code has been
written that implements behaviour which may or may not turn out to be
ideal for your situation. Thanks.

making progress in a meritocracy (was: Re: format of svn:author)

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jan 03, 2012 at 12:53:39PM -0500, Mark Mielke wrote:
> 1) Require a means to reliably determine the AUTHOR of a changeset.
> Reliable here means machine consumable in a standard format which
> all tools are aware of because the standard is documented.
> 
> 2) Require all native output from the tool (such as "svn log")
> designed to be read by humans to include a convenient and easily
> readable format.
> 
> 3) Provide a standard convention or protocol for 3rd party tools to
> reliably determine either the unique identifier or the humanly
> readable expansion from Subversion. Either provide additional
> information in the commit itself, or provide a mechanism to either
> lookup the information, or a mechanism to lookup how to get the
> information.

That's a good summary of your requirements. It sums up what you've
been explaining so far.

> >You can keep criticising us all you want, it won't change a thing
> >if you don't also explain in detail what needs to be changed.
> >We cannot read your mind to obtain a functional specification.
> 
> I know. That's why I wanted to go away and think before coming back.

Then do that. Nobody's stopping you.
I think this is exactly what you'd need to do next.

> I'm probably wrong for being frustrated, and I'm probably wrong for
> how am I approaching this. I should probably just sit back and watch
> again for a while. I'm sure you haven't appreciated my criticism,

I suspect the process seems hard to you because this community works
as a meritocracy and you aren't use to working in this way.
Maybe I'm wrong and you already understand most of what I'm about to
say but I'll write it down anyway in case it helps.

Other projects work differently. It might seem easier for things
to happen quickly and arguments to be decided when there's one benevolent
dictator sitting on top making all the tough decisions, like in the git
project, or with a company where upper management is responsible for
enforcing the direction the project is going (clearcase, perforce).

The Subversion project doesn't use either of these approaches.

I suspect that you're frustrated with the style of development that
happens in a community driven by consensus, because you want to see
stuff get done fast, and are afraid of investing time and effort which
might turn out to be in vain in case the community as a whole doesn't
accept the results of your work.

On top of that, judging reactions from a group of people you've never
physically met and trying to gauge the general opinion that's forming
during discussion within such a group is very hard. It requires a bit
of luck as well as confidence and communication skills.

But remember that just because somebody is objecting to aspects of your
ideas doesn't mean that your ideas won't eventually be accepted.
The purpose of the process is to filter out the good ideas from the bad
ones, and transform good ideas into great ones. That requires time, a thick
skin, and the ability to seriously question ones own motives and ideas
for the merit they will bring to the entire community comprised of users
and developers.

Stuff gets done when one or more people who want to drive change sit
down and do the work. When they get stuck at any point in the development
process (design, implementation, testing...) they consult this list for help.
This also allows them to keep an eye on the community's reaction to
their work. There is no requirement for drivers of change to already be
known members of the development community -- the community simply grows
when this happens. The only requirement is that, eventually, everyone is
happy with the changes being made. In extreme cases, there may be voting.
But in this project voting only happened twice within more than 10 years,
and one of those instances was about whitespace formatting in the code so,
yes, we tend to have long-winded discussions :)

Whenever I make non-trivial changes to Subversion I make two assumptions
that you don't seem to be happy to make. I assume that it will take me
forever to get it done, and I assume my ideas and my code are initially
mostly wrong. I don't start out assuming I know what's right.
I assume I'm wrong, and then slowly try to work towards being less wrong.
I rely on the community to help me be less wrong and move towards being
right. That's one way of eventually getting things done in a meritocracy.
This approach prevents frustration about anyone but myself. If I fail,
I failed because of my own fault, so I need to improve and try again.
I don't fail because I was right and everyone else didn't agree with me.

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/03/2012 12:27 PM, Stefan Sperling wrote:
> On Tue, Jan 03, 2012 at 12:11:20PM -0500, Mark Mielke wrote:
>> Other solutions provide these capabilities out of box.
> Could you point out which solutions exist so people can take a look at them?

GIT, ClearCase, and Perforce are the ones I use.

GIT has name, email, signing keys, and others but is not widely used in 
our organization at this time. The original start of this thread was a 
person talking about GIT and mapping attributes from GIT to Subversion 
and how Subversion didn't seem to have the right attributes to do this.

ClearCase, as mentioned, stores both a "current owner" and a "creation 
event". The "current owner" is in native format - the UNIX uid#/gid#, 
and mapped back to a UNIX username/groupname on demand. The current 
owner is more useful for access controls. The "creation event" contains 
a string which includes the username, domainname (NIS), and fullname. We 
have a customization against ClearCase that adds in submission 
authentication to support shared UNIX accounts that will tie a unique 
identifier to the submission.

Perforce has an account management system of its own. Which isn't 
necessarily an end goal on its own, but when tied with a synchronization 
system such that Perforce accounts match upstream accounts, it does work 
fairly well.

But these are just examples of what other people have done. The 
requirements are pretty straight forward:

1) Require a means to reliably determine the AUTHOR of a changeset. 
Reliable here means machine consumable in a standard format which all 
tools are aware of because the standard is documented.

2) Require all native output from the tool (such as "svn log") designed 
to be read by humans to include a convenient and easily readable format.

3) Provide a standard convention or protocol for 3rd party tools to 
reliably determine either the unique identifier or the humanly readable 
expansion from Subversion. Either provide additional information in the 
commit itself, or provide a mechanism to either lookup the information, 
or a mechanism to lookup how to get the information.

GIT uses attributes. ClearCase uses attributes. Perforce uses lookups.

> You can keep criticising us all you want, it won't change a thing
> if you don't also explain in detail what needs to be changed.
> We cannot read your mind to obtain a functional specification.

I know. That's why I wanted to go away and think before coming back.

I am trying to determine where we would like to go next, and although 
Subversion has been a favourite of mine since 2003 or so, every time I 
try to seriously consider it, it seriously disappoints me. Our 
organization will put time and money into the direction we choose - but 
I can't responsibly select a tool which does not meet our requirements 
no matter what my fancy.

I've been waiting a long time for Subversion to come of age. I've 
monitored throughout. From time to time, I've tried to help. It is 
disenchanting when I see new solutions come out of nowhere (i.e. GIT) 
that already meet requirements out of box with authors and contributors 
that already understand the problems and the solutions which seem to be 
extremely difficult to implement in Subversion. Everything - even things 
as simple as this problem - seem like an incredibly chore in Subversion 
land.

I'm probably wrong for being frustrated, and I'm probably wrong for how 
am I approaching this. I should probably just sit back and watch again 
for a while. I'm sure you haven't appreciated my criticism, and for many 
of you it is probably not deserved. You have your itches to scratch, and 
you are itching them. Why should you care about my itches? You are being 
nice to me to bother to consider my itches at all. :-)

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Stefan Sperling <st...@elego.de>.

On Tue, Jan 03, 2012 at 12:11:20PM -0500, Mark Mielke wrote:
> Other solutions provide these capabilities out of box.

Could you point out which solutions exist so people can take a look at them?

You can keep criticising us all you want, it won't change a thing
if you don't also explain in detail what needs to be changed.
We cannot read your mind to obtain a functional specification.

Re: format of svn:author

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

kmradke@rockwellcollins.com wrote on Thu, Jan 05, 2012 at 14:03:37 -0600:
> Mark Mielke <ma...@mark.mielke.cc> wrote on 01/05/2012 12:36:10 PM:
> > On 01/05/2012 12:34 PM, Branko Čibej wrote:
> > > On 05.01.2012 18:25, Mark Mielke wrote:
> > >> On 01/05/2012 12:04 PM, Branko Čibej wrote:
> > >>> Ha, but svn:author currently fills that role. So why add another
> > >>> property?
> > >> If svn:author is defined as the primary key and also the
> > >> authentication key, it does seem simpler and more compatible with
> > >> existing tool assumptions and existing documentation.
> > > svn:author is basically "the username". Of course, many installations,
> > > especially those that use client certificates, will put other things
> > > there; an example I've ofthen seen is CN (Email), which usually is not
> > > what you'd really want since neither is unique or persistent.
> > 
> > Yep. Microsoft AD likes to use user's name in the DN (Distinguished 
> > Name), or at least that is how many people seem to configure it. Yuck. 
> > In any case, I would say it's the responsibility of the organization to 
> > decide what their unique identifier is. If they choose a bad one - 
> > that's on them. :-)
> > 
> > For many systems, username is pretty good.
> 
> Coming late to the discussion, but assuming you are using apache,
> one could use an existing (or custom) auth module in apache
> to mangle/rewrite/map the provided user id that subversion
> uses to something that may be more useful.  Subversion will
> then happily store whatever is provided in the author field.
> This would purely be a server side configuration.

You can do that in the pre-commit hook too.

Re: format of svn:author

Posted by Julian Foad <ju...@btopenworld.com>.

Mark Mielke wrote:

> Stefan Fuhrmann wrote:
>>  On 04.01.2012 19:42, Julian Foad wrote:
>>>     The extended author fields are delivered through revision 
>>> properties [that] are readable but not writable by clients.
>> 
>>  Maybe, I missed something in your post but I want
>>  to stress that is very important to be able to change
>>  that information later on.
> 
> The idea that Julian put forward is that these would be calculated 
> fields. Never stored. [...]

That's right.  Thanks for clarifying, Mark.

By the way, I'm happy to continue giving feedback, reviews and suggestions on this feature.  I can't at this stage promise to do more than that.

- Julian

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/08/2012 09:55 PM, Stefan Fuhrmann wrote:
> On 04.01.2012 19:42, Julian Foad wrote:
>>    The extended author fields are delivered through revision 
>> properties.  The values are UTF-8 text.  These revision properties 
>> are readable but not writable by clients.
>
> Maybe, I missed something in your post but I want
> to stress that is very important to be able to change
> that information later on.

The idea that Julian put forward is that these would be calculated 
fields. Never stored. Always "fresh" (beyond allowance for server side 
caching configured according to requirements).

I think this is perfectly adequate for all use cases I can think of. 
However, it does mean that historical users need to be maintained within 
the server-side database that is used to calculate the field values.

Any fields which are truly per-commit and not per-user, are not 
svn:author fields. They are something else. Per-user fields can stay as 
"latest" or if there is ever a scenario where a user must have an "old" 
identity and a "new" identity, this could be accomplished by having the 
user change their unique identifier allowing both old- and new- mappings 
to remain valid.

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Stefan Fuhrmann <eq...@web.de>.

On 04.01.2012 19:42, Julian Foad wrote:
>    The extended author fields are delivered through revision properties.  The values are UTF-8 text.  These revision properties are readable but not writable by clients.

Maybe, I missed something in your post but I want
to stress that is very important to be able to change
that information later on.

One use-case is a repository move, another are
changes to the user accounts themselves (had
that more than once in the past). Because typical
pre-revprop-change scripts will compare the
current user with the rev's creator before it accepts
log message changes, an update of old user info
seems necessary.

Once we are at it, a server-side tool for efficient batch
user changes would be nice (millions of revprop
changes distributed over multiple repositories).
>    Three property names are initially designated  as "well known":
>
>      * prop name: "svn:author:authn-id"
>        purpose: authenticated user id
>        format: as used by Subversion's authentication (the default
>          value of svn:author)
>
>      * prop name: "svn:author:display-name"
>        purpose: display name
>        format: a single line (no line breaks), e.g. person's full
>          name or shortened name or nickname
>
>      * prop name: "svn:author:email"
>        purpose: email address
>        format: [TO BE SPECIFIED HERE]
>
A general observation: It seems impractical to store
anything but the user / account information. The
strategy would be to hope that one of the 3 aspects
of a given account will not change over time.

Modelling a real person with such aspects like having
different accounts at the same time (happens in
sufficiently large companies) seems to be out of scope
entirely. It would open a whole new can of worms.

The deeper problem behind all this is that we record
the history of ones user management and not only
apply the the current, consistent account settings.
Introducing ACLs within SVN at some point in the
future will probably make that issue much more obvious.

-- Stefan^2.

Re: format of svn:author

Posted by Branko Čibej <br...@apache.org>.

On 04.01.2012 19:42, Julian Foad wrote:
> DESIGN
>
>   The extended author fields are delivered through revision properties.  The values are UTF-8 text.  These revision properties are readable but not writable by clients.
>
>   Three property names are initially designated  as "well known":
>
>     * prop name: "svn:author:authn-id"
>       purpose: authenticated user id
>       format: as used by Subversion's authentication (the default
>         value of svn:author)
>
>     * prop name: "svn:author:display-name"
>       purpose: display name
>       format: a single line (no line breaks), e.g. person's full
>         name or shortened name or nickname
>
>     * prop name: "svn:author:email"
>       purpose: email address
>       format: [TO BE SPECIFIED HERE]

At the /very/ least you have to define which of the properties must have
values that are unique within the given repository; what is the primary
key; and how to select the property to be shown in log, blame, and the like.

-- Brane

Re: format of svn:author

Posted by km...@rockwellcollins.com.

Mark Mielke <ma...@mark.mielke.cc> wrote on 01/05/2012 12:36:10 PM:
> On 01/05/2012 12:34 PM, Branko Čibej wrote:
> > On 05.01.2012 18:25, Mark Mielke wrote:
> >> On 01/05/2012 12:04 PM, Branko Čibej wrote:
> >>> Ha, but svn:author currently fills that role. So why add another
> >>> property?
> >> If svn:author is defined as the primary key and also the
> >> authentication key, it does seem simpler and more compatible with
> >> existing tool assumptions and existing documentation.
> > svn:author is basically "the username". Of course, many installations,
> > especially those that use client certificates, will put other things
> > there; an example I've ofthen seen is CN (Email), which usually is not
> > what you'd really want since neither is unique or persistent.
> 
> Yep. Microsoft AD likes to use user's name in the DN (Distinguished 
> Name), or at least that is how many people seem to configure it. Yuck. 
> In any case, I would say it's the responsibility of the organization to 
> decide what their unique identifier is. If they choose a bad one - 
> that's on them. :-)
> 
> For many systems, username is pretty good.

Coming late to the discussion, but assuming you are using apache,
one could use an existing (or custom) auth module in apache
to mangle/rewrite/map the provided user id that subversion
uses to something that may be more useful.  Subversion will
then happily store whatever is provided in the author field.
This would purely be a server side configuration.  Some auth
modules already do some manipulation to what the user provides,
such as removing the windows domain info or everything
after @.

I'd actually hate to be capturing additional information such
as email address for a specific user since that could change
and is just duplicating what is already available via other
means.  If and when I want/need that info I'd much prefer to
look it up in a directory to get the current value instead of
relying on something attached to an old transaction.

As mentioned, choosing that unique key is important, and
in an enterprise it is essential to ensure all tools
are sharing that same identifier...

Kevin R.

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/05/2012 12:34 PM, Branko Čibej wrote:
> On 05.01.2012 18:25, Mark Mielke wrote:
>> On 01/05/2012 12:04 PM, Branko Čibej wrote:
>>> Ha, but svn:author currently fills that role. So why add another
>>> property?
>> If svn:author is defined as the primary key and also the
>> authentication key, it does seem simpler and more compatible with
>> existing tool assumptions and existing documentation.
> svn:author is basically "the username". Of course, many installations,
> especially those that use client certificates, will put other things
> there; an example I've ofthen seen is CN (Email), which usually is not
> what you'd really want since neither is unique or persistent.

Yep. Microsoft AD likes to use user's name in the DN (Distinguished 
Name), or at least that is how many people seem to configure it. Yuck. 
In any case, I would say it's the responsibility of the organization to 
decide what their unique identifier is. If they choose a bad one - 
that's on them. :-)

For many systems, username is pretty good.

>> There is of course some expectations around transition - such as we'd
>> only want to do the conversion to the new model once some key tools
>> supported it - "svn log", TortoiseSVN, Subclipse, and Crucible/FishEye
>> will begin working right away as the content of svn:author is now
>> recognizable as Crucible/FishEye user identifiers without the need to
>> define committer mappings and the Subversion metadata could be
>> re-indexed. I think it wouldn't be a problem beyond scheduling.
> Well, given that revision properties aren't indexed at all ... my use of
> the term "primary key" was a bit overdone, since it's really just a
> convention, not a requirement. But If we extend the way we identify
> authors, we'd better do something about enforcing these requirements, too.

Sorry for the confusion. I take it Crucible/FishEye is not widely used 
around here? In any case, FishEye is a tool like ViewVC that scans 
repositories such as Subversion repositories and creates an index to 
allow users to perform lookups from a web view. All commits owned by a 
user. Files that contain a particular text string. Etc. So when I 
mention re-index above, I mean asking Crucible/FishEye to dump its index 
and to re-scan the Subversion repositories. This would allow it to pick 
up the new properties and reset its statistics.

In terms of requirements - I don't think Subversion needs to enforce the 
requirements. It needs only make them known (which is perhaps what you 
are saying). The only true requirement is that the unique identifier can 
be reliably used to lookup additional data. The additional data may or 
may not be unique keys - but this would be up to the upstream data 
source to define. Display name would not generally be unique. Email 
might or might not be unique - there are scenarios for both. For 
example, some users may have a secondary account that they use for 
another purpose, but they might have the same "contact email address". I 
think the requirement is that svn:author be usable as a primary key, and 
that any support for pluggable modules to provide additional data will 
only be given this primary key to determine what additional data to return.

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Branko Čibej <br...@xbc.nu>.

On 05.01.2012 18:25, Mark Mielke wrote:
> On 01/05/2012 12:04 PM, Branko Čibej wrote:
>> On 05.01.2012 11:32, Julian Foad wrote:
>>> Branko wrote:
>>>> [...] you have to define which of the properties must  have values
>>>> that are unique within the given repository; what is the primary key;
>>> OK, let's say:
>>>
>>> The "svn:author:authn-id" value is the primary key, and so is unique
>>> within a [Subversion repository | Subversion server ?].
>> Ha, but svn:author currently fills that role. So why add another
>> property?
>
> If svn:author is defined as the primary key and also the
> authentication key, it does seem simpler and more compatible with
> existing tool assumptions and existing documentation.

svn:author is basically "the username". Of course, many installations,
especially those that use client certificates, will put other things
there; an example I've ofthen seen is CN (Email), which usually is not
what you'd really want since neither is unique or persistent.

>
>>>    The administrator must configure the Subversion server to perform
>>> a mapping from "svn:author" value to the primary key, typically the
>>> trivial "x ->  x" mapping but another example could extract "1234"
>>> from "John Doe (1234)".
>> That seems less than optimal. Your specification changes the meaning of
>> svn:author. Do you intend this to cater to the installations that are
>> already abusing and overloading svn:author?
>
> As one of these abusers, I don't mind re-writing history to fix this
> problem. I don't have a need for catering here. As per the previous
> email around the original problem of importing content from GIT, I
> don't mind either of:
>
> 1) Prevent users from setting svn:author:* properties, but if they
> happen to exist - to serve them instead of doing a lookup. In this
> case, I would migrate historical data using revprops and make
> svn:author become the primary key / unique identifier again.
>
> 2) Migrate users that do not exist into a database of removed users
> and have the data available for lookup resolution.
>
> Either would work fine.
>
> There is of course some expectations around transition - such as we'd
> only want to do the conversion to the new model once some key tools
> supported it - "svn log", TortoiseSVN, Subclipse, and Crucible/FishEye
> will begin working right away as the content of svn:author is now
> recognizable as Crucible/FishEye user identifiers without the need to
> define committer mappings and the Subversion metadata could be
> re-indexed. I think it wouldn't be a problem beyond scheduling.
>

Well, given that revision properties aren't indexed at all ... my use of
the term "primary key" was a bit overdone, since it's really just a
convention, not a requirement. But If we extend the way we identify
authors, we'd better do something about enforcing these requirements, too.

-- Brane

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/05/2012 12:04 PM, Branko Čibej wrote:
> On 05.01.2012 11:32, Julian Foad wrote:
>> Branko wrote:
>>> [...] you have to define which of the properties must  have values
>>> that are unique within the given repository; what is the primary key;
>> OK, let's say:
>>
>> The "svn:author:authn-id" value is the primary key, and so is unique within a [Subversion repository | Subversion server ?].
> Ha, but svn:author currently fills that role. So why add another property?

If svn:author is defined as the primary key and also the authentication 
key, it does seem simpler and more compatible with existing tool 
assumptions and existing documentation.

>>    The administrator must configure the Subversion server to perform a mapping from "svn:author" value to the primary key, typically the trivial "x ->  x" mapping but another example could extract "1234" from "John Doe (1234)".
> That seems less than optimal. Your specification changes the meaning of
> svn:author. Do you intend this to cater to the installations that are
> already abusing and overloading svn:author?

As one of these abusers, I don't mind re-writing history to fix this 
problem. I don't have a need for catering here. As per the previous 
email around the original problem of importing content from GIT, I don't 
mind either of:

1) Prevent users from setting svn:author:* properties, but if they 
happen to exist - to serve them instead of doing a lookup. In this case, 
I would migrate historical data using revprops and make svn:author 
become the primary key / unique identifier again.

2) Migrate users that do not exist into a database of removed users and 
have the data available for lookup resolution.

Either would work fine.

There is of course some expectations around transition - such as we'd 
only want to do the conversion to the new model once some key tools 
supported it - "svn log", TortoiseSVN, Subclipse, and Crucible/FishEye 
will begin working right away as the content of svn:author is now 
recognizable as Crucible/FishEye user identifiers without the need to 
define committer mappings and the Subversion metadata could be 
re-indexed. I think it wouldn't be a problem beyond scheduling.

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Branko Čibej <br...@apache.org>.

On 05.01.2012 11:32, Julian Foad wrote:
> Branko wrote:
>> [...] you have to define which of the properties must  have values
>> that are unique within the given repository; what is the primary key;
> OK, let's say:
>
> The "svn:author:authn-id" value is the primary key, and so is unique within a [Subversion repository | Subversion server ?].

Ha, but svn:author currently fills that role. So why add another property?

>   The administrator must configure the Subversion server to perform a mapping from "svn:author" value to the primary key, typically the trivial "x -> x" mapping but another example could extract "1234" from "John Doe (1234)".

That seems less than optimal. Your specification changes the meaning of
svn:author. Do you intend this to cater to the installations that are
already abusing and overloading svn:author?

-- Brane

Re: format of svn:author

Posted by Julian Foad <ju...@btopenworld.com>.

Branko wrote:
> [...] you have to define which of the properties must  have values
> that are unique within the given repository; what is the primary key;

OK, let's say:

The "svn:author:authn-id" value is the primary key, and so is unique within a [Subversion repository | Subversion server ?].  The administrator must configure the Subversion server to perform a mapping from "svn:author" value to the primary key, typically the trivial "x -> x" mapping but another example could extract "1234" from "John Doe (1234)".

This specification does not require the values of any other extended author field to be unique.

The administrator may guarantee locally that a particular extended author field is unique in some scope.  For example, a build-bot can update an issue tracker, and so needs to know the issue tracker user id for the author of a particular Subversion revision.  The administrator configures Subversion to provide that id in the "author:tracker-uid" revision property.  The issue tracker user id needs to be unique among all users of the tracker, of course, and so the administrator ensures that is true and then tells the build-bot which of Subversion's extended author fields holds the issue tracker user id: that is, "author:tracker-uid".  Note that its values are unique among all users of that issue tracker, not necessarily the same as being unique across all users of a particular Subversion repository or all Subversion repositories.

> and how to select the property to be shown in log, blame, and the like.

That is briefly stated in the "CLIENT DESIGN" section -- basically, client-side configuration.  (Client-side configuration is of course not ideal, but is a stepping stone to server-dictated configuration which is the subject of a separate and concurrent design effort.)

Mark Mielke wrote:
> On 01/04/2012 01:42 PM, Julian Foad wrote:
>>  A PROPOSAL FOR EXTENDED AUTHOR IDENTIFICATION
>> 
>>  USE CASES
>> 
>>  1.[This one I am aware of.]
>> 
>>     A large company has authenticated user ids that are numeric.  That 
>> means the "log" and "blame" information shown by most Subversion clients 
>> is not easy to understand.  Therefore they use a (post-commit?) hook to 
>> change  the svn:author property to a more friendly string, which (mostly) 
>> solves the display issue.  However, it causes other problems.  [What 
>> problems?]
> 
> Problems:
> 
> 1) The unique identifier is no longer a direct match against external identity 
> management systems. [...]
> 
> 2) Users may end up with multiple unique identifiers over time [...]

So, basically putting display information in svn:author may not cause a problem in that scenario alone but will cause a problem if and when other tools want the value to be a unique id.

>>  2. [This one is a guess.]
>> 
>>     The leader of a small development team sharing a Subversion repository 
>> with other teams wants to set up a build slave that will send an email [...]
> 
> Much of the above can be accomplished today as it is server side [...].
> To extend the above to a situation that makes it more difficult -

Actually I meant UC2 to be a client-side problem like you're describing, so we're both talking about the same thing.

[...]

To everything else you said: yes, sounds good.

- Julian

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/04/2012 01:42 PM, Julian Foad wrote:
> A PROPOSAL FOR EXTENDED AUTHOR IDENTIFICATION
>
> USE CASES
>
> 1.[This one I am aware of.]
>
>    A large company has authenticated user ids that are numeric.  That means the "log" and "blame" information shown by most Subversion clients is not easy to understand.  Therefore they use a (post-commit?) hook to change
> the svn:author property to a more friendly string, which (mostly) solves the display issue.  However, it causes other problems.  [What problems?]

Problems:

1) The unique identifier is no longer a direct match against external 
identity management systems. For example, if svn:author is "Mark Mielke 
(1234567)" and LDAP stores employeeNumber="124567" and cn="Mark Mielke", 
very few tools support the ability to pattern match svn:author to pull 
out character groups and to then lookup in an external identity 
management system using the character group. I can't think of a single 
tool that provides this capability out of box. In these tools, if I am 
logged in as "1234567" it cannot know which commits are mine, because 
"1234567" is not equal to "Mark Mielke (1234567)".

2) Users may end up with multiple unique identifiers over time due to 
the unique identifier portion being combined with a more approximate 
(and therefore inaccurate) humanly readable form. Display name or email 
may change over time, and the ability to uniquely identify the author 
becomes more complex as the mapping must include every instance 
discovered at commit time. Some of this is subject to which identifier 
is selected as the unique identifier - but let us say that a system such 
as Forge is used and the identifier is some sort of username such as 
"twoleftfeet". The email might start as "joe@doe.com", but end up as 
"jdoe@acme.com". Any report around commits such as commits made per 
user, or for a particular user - would either end up with split history 
(treating the history as belong to two or more users) or the reporting 
algorithm would need to allow for each instance to be recognized as the 
same user. Similarly - names can change. Perhaps the person gets married 
or divorced. "Mary Clairmont (prettygirl99)" becomes "Mary Dupont 
(prettygirl99)".

For both of these problems, one could argue that the reporting tool 
could take the complex value into account. It could parse out the unique 
identifier. This presumes that you have access to the source code and 
the ability to make the changes which (license restrictions, resource 
requirements, ...). This could be true of one or two tools - but 
certainly not all tools that support Subversion as this is a fairly 
massive list. This is particularly problematic if there is no standard 
as it means that my work in my company against my convention is not 
easily shareable with your work in your company against your convention.

> 2. [This one is a guess.]
>
>    The leader of a small development team sharing a Subversion repository with other teams wants to set up a build slave that will send an email to the users who committed revisions leading to a build failure.  The machine can see the Subversion user id but how can it get the user's email address?  The team leader could ask the repository administrator to add a post-commit hook that adds an email address to a revision property after every commit, but that
>
>      * requires involving the server admin;
>      * won't get updated when the user changes their email address;
>      * won't work for testing old revisions that were already committed before that time;
>      * won't work if the build slave software needs to read a list of all user id->email mappings at once.

Much of the above can be accomplished today as it is server side and 
server side gives more flexibility as it can be customized in one place. 
To extend the above to a situation that makes it more difficult -

There are a number of tools such as Crucible/FishEye that will monitor a 
Subversion repository for changes, and then take action based on the 
commit log. So the actions are being performed by "clients" and not by 
the server itself. If the "client" sees a Subversion commit for 
"1234567" or "jdoe", how does it know who is the authority on what email 
is associated with this account? With svn:author being the unique 
identifier - this is not that difficult in many cases as it is a simple 
LDAP query away. However, if we mix 1) and 2) together, we get the same 
problem. Subversion users need to see full name in "svn log" output, so 
they update svn:author to include the full name like "Mark Mielke 
(1234567)", and then Crucible/FishEye sees the commit as authored by 
"Mark Mielke (1234567)" and how does it look up this value in LDAP to 
find the email?

> 3. [This one is a guess.]
>
>    An administrator wants to integrate Subversion with an issue tracker.  Users have different user ids on the two tools.  The admin wants to configure the tracker so that it automatically annotates an already committed Subversion revision with some status information.  How can the tracker know with what user id to contact the Subversion server?

We don't have this requirement, but I believe this requirement can be 
seen in situations such as:

1) Issue tracker, such as JIRA, is externally visible. Users and 
customers can sign up to the external site directly. Identity management 
system is stored in JIRA as these are essentially "external users".

2) Source management system, such as Subversion, is internal only. Users 
and customers may be able to access the content read-only. Identity 
management system is stored in Microsoft Active Directory or OpenLDAP 
and are assigned according to corporate policies.

In this scenario, there are a lot of requirements to be able to map back 
and forth between the internal and external ID. The binding might be 
stored as an LDAP attribute such as "jirauser".

I don't know if this particular problem is for Subversion to solve or 
not - but if the Subversion solution was general enough to support 
configuration that might allow this information to be exposed in a 
general way, somebody someday would probably be thankful. I wouldn't go 
out of my way to specifically solve this requirement, though. Just, if 
it comes for free with a good solution to the other requirements, don't 
block it. :-)

> The rest of the proposal addresses UC1 and part of UC2 but not UC3.  (UC3 looks like it needs some totally separate solution, outside of Subversion.)

Agree.

> REQUIREMENTS
>
>    A Subversion client (of any kind so designed) shall be able to read extended information about the author of a revision.  This information shall consist of a (possibly empty) set of fields.  The set of possible extended author fields shallinclude at least:
>
>      * authenticated user id
>
>      * display name
>      * email address
>
>    It shall be possible to add other fields on the server side (by software upgrade and/or by configuration), and for a client (of any kind so designed) to discover and read these fields without any software upgrade on the client side.
>    The svn:author property shall continue to exist.  When not using the extended author fields, the svn:author property must continue to operate as before.  When using the extended author fields, the design may restrict the use of the svn:author field.  Example: the design could require that if extended author fields are to be usable then the svn:author field always holds the authenticated user id and must always be present and non-empty.

This is a smart compromise. Forwards and backwards compatibility. 
Interface restrictions to guarantee extensibility.

In terms of some actual implementation of this, the documentation should 
probably recommend that clients make use of the display name and email 
address as standard fields, and only optionally be aware of 
repository-specific additional attributes. Otherwise it gets pretty 
messy in that you'd have to provide a means to make clients aware of 
what is being published and how and where they should be displayed. I 
would start with just the two and specific recommendations. For example, 
annotated source code on a web page might show the display name, but 
when one mouses over the display name or clicks on a gear icon to the 
side, access to additional details might be displayed. The display name 
might be linked such that a mouse click on the display name pulls up the 
user profile, but the user profile would be identifier by the unique 
identifier. Enough information to recommend a consistent and useful 
interface, but not enough to be restrictive.

You cover some of this below:

>    A client shall access the extended author fields through the Subversion server, through the existing client-server protocols, possibly with protocol extensions.  Any protocol extensions shall be backward compatible in that an old server with a new client or an old client with a new server shall (without user intervention) use the old 'svn:author' property.
>
>
>    The fields that are available from a particular server or repository are determined by the administrator.  For any particular committed revision, the server may provide any or all or none of the extended author fields.  A client cannot rely on any particular field being available except to the extent that the administrator gives such an assurance.  Example: if the client requests the authenticated user id and email address for a revision whose author has no email address recorded,the server shall provide the authenticated user id but no email address.  If the server is temporarily unable to look up any information about a user, the server should respond with no extended author fieldsinstead of waiting.
>
>
>    The extended author fields are dynamic in the sense that the server need not always return the same values for the same committed revision.  For example,a client might repeat exactly the same request for information about revision 1234 twice in quick succession, and the server might provide the email address as "a@b.c" the first time and "dd@ee.ff" the second time.  Even the "authenticated user id" field could change.
>
>
> DESIGN
>
>    The extended author fields are delivered through revision properties.  The values are UTF-8 text.  These revision properties are readable but not writable by clients.
>
>    Three property names are initially designated  as "well known":
>
>      * prop name: "svn:author:authn-id"
>        purpose: authenticated user id
>        format: as used by Subversion's authentication (the default
>          value of svn:author)
>
>      * prop name: "svn:author:display-name"
>        purpose: display name
>        format: a single line (no line breaks), e.g. person's full
>          name or shortened name or nickname
>
>      * prop name: "svn:author:email"
>        purpose: email address
>        format: [TO BE SPECIFIED HERE]
>
>
>    Other property names in this name space beginning with "svn:author:" can be designated as "well known" in the future, by an official announcement from the Subversion project.
>
>    An administrator can configure other extended author fields to use property names that are not in the "svn:" name space.  Example: an administrator could configure the property name "author:pgp-sig" to hold the author's PGP signature.

Excellent.

> SERVER DESIGN
>    Any time the server is about to send a set of revision properties to
> the client, the server looks up the extended author fields and adds
> corresponding properties to the set of revision properties that it
> reports to the client.  These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key.  The server may cache the results, provided that there is a way for the administrator to make the server use updated information.

The cache can be a typical cache. The information that might be returned 
should generally be semi-persistent and not changing from minute to 
minute. As long as it takes effect within a reason time period 
(configurable along with the configuration on how to obtain the extended 
attribute information in the first place?) there is no problem.

>    If the client attempts to set any revision property in the "svn:author:" name space, the server shall report an error to the client.  This applies even if the property value matches the value that was last read from the server or is currently known to the server, and even if the
> specific property name is not known to the server.  If the client attempts to set any revision property that is not in the "svn:author:" name space but might be configured as an extended author field, the server records that revision property in the normal way.  If a revision property (of any name) has a stored value and the extended author field look-up also provides a value for the same property name, the latter takes priority.
>
>
>    The extended author fields [are | are not] available to the following hook scripts: pre-commit, ...

Although not necessary for the fields to be available to the hook 
scripts - it would be extremely convenient for them to be so. We have 
hooks that perform LDAP lookups - but each hook has to have intimate 
knowledge of the environment it is contained in making them difficult to 
be published - for example, as an open source component that others 
could re-use. They may have hard coded LDAP bind passwords for example, 
making them insecure to publish. It would be extremely nice if any open 
source component writer could make use of these fields without having to 
care where the values come from, and the configuration for where the 
values come from could be centralized in one place - the Subversion server.

> CLIENT DESIGN
>
>    Just an example.  The "svn log" and "svn blame" commands could request the revision property named "svn:author:display-name", and if that is returned then use it instead of "svn:author", otherwise use the value of "svn:author".  Further, a client-side configuration option could specify which property name should be used for these display purposes, so for example some users in a particular team could choose to have the "author:nickname" revision property displayed instead of "svn:author:display-name".

This would be great. I think many people like to see the format that GIT 
uses: Display Name <em...@domain>. This should be an option.

> FURTHER SCOPE
>
>    Does a client need to be able to look up the information in other ways, such as starting from svn:author rather than a revision number, or starting from an extended author field?
>

I'm not clear on how "svn blame" is implemented. Presuming that it knows 
what commit each line belongs to and that these are already being 
queried (i.e. the implementation won't have to significantly change as a 
result of this proposal), it is satisfactory for it to access the 
information from the revision properties. I don't at the moment see a 
requirement to be able to query a list of known users, or information 
for a particular user. Subversion is not a directory service. The main 
capability being provided is to enable Subversion clients to be ignorant 
about how the server has been configured to perform authentication and 
identification of users, but still be able to provide extended 
information about Subversion metadata back to the user. Staying within 
domain is probably smart as it can be a clear boundary around the scope 
that is being agreed to.

Final thoughts on this draft:

The reference implementation should come with perhaps two server modules 
to support this capability. One should be a caching LDAP implementation 
that is fully configurable. One should be based on operating system 
services (PAM or getent() for Unix?). Other implementations should be 
possible, but left outside of core.

If the Subversion developers agree to some refinement of this proposal, 
I understand that developers resources are limited and that there is no 
guarantee that it would ever be implemented or if implemented that it 
would ever be completed and distributed in core. I'm thinking that this 
sort of project might be a good entry point for somebody such as myself 
to contribute. Not sure about time right now - but if you put in the 
effort to review and refine, then it would be only fair for me to at 
least try to contribute.

Thanks for the time you put into this Julian.

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

This is great, Julian. It is pretty good for a draft. I'll get back to 
you with the detailed answers tonight - just wanted to give my thumbs up.

I'm ok with a simpler solution that just sets the attributes on commit, 
but what you have described looks like a good step up from the minimum 
and solves additional requirements which fall under "would be nice" for 
me... Thanks!


On 01/04/2012 01:42 PM, Julian Foad wrote:
> Hi Mark.
>
> I think I can see to some extent what you are getting at, but not clearly.  We all need a common frame of reference for understanding why and how some sort of extended author information could be useful.  To help us get there, I put
> together the following tentative proposal to act as a basis for
> discussion.  Perhaps we can now move on to talking about specific requirements and designs.  What parts of it are aligned with your thinking and what
> have I got wrong or missed out?
>
> Please note that this draft is purely an invention of my mind and I do not expect it to be an accurate reflection of your or anyone else's requirements.
>
>
> A PROPOSAL FOR EXTENDED AUTHOR IDENTIFICATION
>
> USE CASES
>
> 1.[This one I am aware of.]
>
>    A large company has authenticated user ids that are numeric.  That means the "log" and "blame" information shown by most Subversion clients is not easy to understand.  Therefore they use a (post-commit?) hook to change
> the svn:author property to a more friendly string, which (mostly) solves the display issue.  However, it causes other problems.  [What problems?]
>
>
> 2. [This one is a guess.]
>
>    The leader of a small development team sharing a Subversion repository with other teams wants to set up a build slave that will send an email to the users who committed revisions leading to a build failure.  The machine can see the Subversion user id but how can it get the user's email address?  The team leader could ask the repository administrator to add a post-commit hook that adds an email address to a revision property after every commit, but that
>
>      * requires involving the server admin;
>      * won't get updated when the user changes their email address;
>      * won't work for testing old revisions that were already committed before that time;
>      * won't work if the build slave software needs to read a list of all user id->email mappings at once.
>
>
> 3. [This one is a guess.]
>
>    An administrator wants to integrate Subversion with an issue tracker.  Users have different user ids on the two tools.  The admin wants to configure the tracker so that it automatically annotates an already committed Subversion revision with some status information.  How can the tracker know with what user id to contact the Subversion server?
>
> The rest of the proposal addresses UC1 and part of UC2 but not UC3.  (UC3 looks like it needs some totally separate solution, outside of Subversion.)
>
>
>
> REQUIREMENTS
>
>    A Subversion client (of any kind so designed) shall be able to read extended information about the author of a revision.  This information shall consist of a (possibly empty) set of fields.  The set of possible extended author fields shallinclude at least:
>
>      * authenticated user id
>
>      * display name
>      * email address
>
>    It shall be possible to add other fields on the server side (by software upgrade and/or by configuration), and for a client (of any kind so designed) to discover and read these fields without any software upgrade on the client side.
>    The svn:author property shall continue to exist.  When not using the extended author fields, the svn:author property must continue to operate as before.  When using the extended author fields, the design may restrict the use of the svn:author field.  Example: the design could require that if extended author fields are to be usable then the svn:author field always holds the authenticated user id and must always be present and non-empty.
>
>
>    A client shall access the extended author fields through the Subversion server, through the existing client-server protocols, possibly with protocol extensions.  Any protocol extensions shall be backward compatible in that an old server with a new client or an old client with a new server shall (without user intervention) use the old 'svn:author' property.
>
>
>    The fields that are available from a particular server or repository are determined by the administrator.  For any particular committed revision, the server may provide any or all or none of the extended author fields.  A client cannot rely on any particular field being available except to the extent that the administrator gives such an assurance.  Example: if the client requests the authenticated user id and email address for a revision whose author has no email address recorded,the server shall provide the authenticated user id but no email address.  If the server is temporarily unable to look up any information about a user, the server should respond with no extended author fieldsinstead of waiting.
>
>
>    The extended author fields are dynamic in the sense that the server need not always return the same values for the same committed revision.  For example,a client might repeat exactly the same request for information about revision 1234 twice in quick succession, and the server might provide the email address as "a@b.c" the first time and "dd@ee.ff" the second time.  Even the "authenticated user id" field could change.
>
>
> DESIGN
>
>    The extended author fields are delivered through revision properties.  The values are UTF-8 text.  These revision properties are readable but not writable by clients.
>
>    Three property names are initially designated  as "well known":
>
>      * prop name: "svn:author:authn-id"
>        purpose: authenticated user id
>        format: as used by Subversion's authentication (the default
>          value of svn:author)
>
>      * prop name: "svn:author:display-name"
>        purpose: display name
>        format: a single line (no line breaks), e.g. person's full
>          name or shortened name or nickname
>
>      * prop name: "svn:author:email"
>        purpose: email address
>        format: [TO BE SPECIFIED HERE]
>
>
>    Other property names in this name space beginning with "svn:author:" can be designated as "well known" in the future, by an official announcement from the Subversion project.
>
>    An administrator can configure other extended author fields to use property names that are not in the "svn:" name space.  Example: an administrator could configure the property name "author:pgp-sig" to hold the author's PGP signature.
>
>
>
> SERVER DESIGN
>    Any time the server is about to send a set of revision properties to
> the client, the server looks up the extended author fields and adds
> corresponding properties to the set of revision properties that it
> reports to the client.  These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key.  The server may cache the results, provided that there is a way for the administrator to make the server use updated information.
>
>
>    If the client attempts to set any revision property in the "svn:author:" name space, the server shall report an error to the client.  This applies even if the property value matches the value that was last read from the server or is currently known to the server, and even if the
> specific property name is not known to the server.  If the client attempts to set any revision property that is not in the "svn:author:" name space but might be configured as an extended author field, the server records that revision property in the normal way.  If a revision property (of any name) has a stored value and the extended author field look-up also provides a value for the same property name, the latter takes priority.
>
>
>    The extended author fields [are | are not] available to the following hook scripts: pre-commit, ...
>
>
> CLIENT DESIGN
>
>    Just an example.  The "svn log" and "svn blame" commands could request the revision property named "svn:author:display-name", and if that is returned then use it instead of "svn:author", otherwise use the value of "svn:author".  Further, a client-side configuration option could specify which property name should be used for these display purposes, so for example some users in a particular team could choose to have the "author:nickname" revision property displayed instead of "svn:author:display-name".
>
>
>
> FURTHER SCOPE
>
>    Does a client need to be able to look up the information in other ways, such as starting from svn:author rather than a revision number, or starting from an extended author field?
>
>
> - Julian


-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/05/2012 07:44 AM, Johan Corveleyn wrote:
> On Wed, Jan 4, 2012 at 7:42 PM, Julian Foad<ju...@btopenworld.com>  wrote:
>
> [ ... ]
>
>> SERVER DESIGN
>>    Any time the server is about to send a set of revision properties to
>> the client, the server looks up the extended author fields and adds
>> corresponding properties to the set of revision properties that it
>> reports to the client.  These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key.  The server may cache the results, provided that there is a way for the administrator to make the server use updated information.
> Just wondering: a lookup approach, does that address the original
> problem that started this whole discussion? I.e.: how to avoid the
> information loss when importing from GIT into a Subversion repository?
> Since GIT has those additional attributes (display name and email
> address?) annotated with every commit, a lookup approach is in general
> not sufficient to store this information ...
>
> Not important I think, but I'm just noting the discrepancy ...

I was thinking this as well, but I dismissed it (perhaps prematurely) 
with the thought that GIT being DVCS, does not have the capability to 
have a centralized authority in terms of mapping these attributes. 
Subversion is designed for centralization of the metadata, and therefore 
it may be a better fit for the mappings to also be centralized. Somebody 
who is importing GIT to Subversion might choose to do so by selecting an 
appropriate unique identifier for their requirements they could then 
import the mappings to the centralized record mapping unique identifier 
to attributes. LDAP or what have you.

Alternatively, the model could normally prevent svn:author:* to be set 
but if they happen to exist, they could be served as historical data.

Either way could be made to work. Not sure what is "best".

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Johan Corveleyn <jc...@gmail.com>.

On Wed, Jan 4, 2012 at 7:42 PM, Julian Foad <ju...@btopenworld.com> wrote:

[ ... ]

> SERVER DESIGN
>   Any time the server is about to send a set of revision properties to
> the client, the server looks up the extended author fields and adds
> corresponding properties to the set of revision properties that it
> reports to the client.  These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key.  The server may cache the results, provided that there is a way for the administrator to make the server use updated information.

Just wondering: a lookup approach, does that address the original
problem that started this whole discussion? I.e.: how to avoid the
information loss when importing from GIT into a Subversion repository?
Since GIT has those additional attributes (display name and email
address?) annotated with every commit, a lookup approach is in general
not sufficient to store this information ...

Not important I think, but I'm just noting the discrepancy ...

-- 
Johan

Re: format of svn:author

Posted by Julian Foad <ju...@btopenworld.com>.

Hi Mark.

I think I can see to some extent what you are getting at, but not clearly.  We all need a common frame of reference for understanding why and how some sort of extended author information could be useful.  To help us get there, I put 
together the following tentative proposal to act as a basis for 
discussion.  Perhaps we can now move on to talking about specific requirements and designs.  What parts of it are aligned with your thinking and what 
have I got wrong or missed out?

Please note that this draft is purely an invention of my mind and I do not expect it to be an accurate reflection of your or anyone else's requirements.


A PROPOSAL FOR EXTENDED AUTHOR IDENTIFICATION

USE CASES

1.[This one I am aware of.]

  A large company has authenticated user ids that are numeric.  That means the "log" and "blame" information shown by most Subversion clients is not easy to understand.  Therefore they use a (post-commit?) hook to change 
the svn:author property to a more friendly string, which (mostly) solves the display issue.  However, it causes other problems.  [What problems?]


2. [This one is a guess.]

  The leader of a small development team sharing a Subversion repository with other teams wants to set up a build slave that will send an email to the users who committed revisions leading to a build failure.  The machine can see the Subversion user id but how can it get the user's email address?  The team leader could ask the repository administrator to add a post-commit hook that adds an email address to a revision property after every commit, but that

    * requires involving the server admin;
    * won't get updated when the user changes their email address;
    * won't work for testing old revisions that were already committed before that time;
    * won't work if the build slave software needs to read a list of all user id->email mappings at once.


3. [This one is a guess.]

  An administrator wants to integrate Subversion with an issue tracker.  Users have different user ids on the two tools.  The admin wants to configure the tracker so that it automatically annotates an already committed Subversion revision with some status information.  How can the tracker know with what user id to contact the Subversion server?

The rest of the proposal addresses UC1 and part of UC2 but not UC3.  (UC3 looks like it needs some totally separate solution, outside of Subversion.)



REQUIREMENTS

  A Subversion client (of any kind so designed) shall be able to read extended information about the author of a revision.  This information shall consist of a (possibly empty) set of fields.  The set of possible extended author fields shallinclude at least:

    * authenticated user id

    * display name
    * email address

  It shall be possible to add other fields on the server side (by software upgrade and/or by configuration), and for a client (of any kind so designed) to discover and read these fields without any software upgrade on the client side.
  The svn:author property shall continue to exist.  When not using the extended author fields, the svn:author property must continue to operate as before.  When using the extended author fields, the design may restrict the use of the svn:author field.  Example: the design could require that if extended author fields are to be usable then the svn:author field always holds the authenticated user id and must always be present and non-empty.


  A client shall access the extended author fields through the Subversion server, through the existing client-server protocols, possibly with protocol extensions.  Any protocol extensions shall be backward compatible in that an old server with a new client or an old client with a new server shall (without user intervention) use the old 'svn:author' property.


  The fields that are available from a particular server or repository are determined by the administrator.  For any particular committed revision, the server may provide any or all or none of the extended author fields.  A client cannot rely on any particular field being available except to the extent that the administrator gives such an assurance.  Example: if the client requests the authenticated user id and email address for a revision whose author has no email address recorded,the server shall provide the authenticated user id but no email address.  If the server is temporarily unable to look up any information about a user, the server should respond with no extended author fieldsinstead of waiting.


  The extended author fields are dynamic in the sense that the server need not always return the same values for the same committed revision.  For example,a client might repeat exactly the same request for information about revision 1234 twice in quick succession, and the server might provide the email address as "a@b.c" the first time and "dd@ee.ff" the second time.  Even the "authenticated user id" field could change.


DESIGN

  The extended author fields are delivered through revision properties.  The values are UTF-8 text.  These revision properties are readable but not writable by clients.

  Three property names are initially designated  as "well known":

    * prop name: "svn:author:authn-id"
      purpose: authenticated user id
      format: as used by Subversion's authentication (the default
        value of svn:author)

    * prop name: "svn:author:display-name"
      purpose: display name
      format: a single line (no line breaks), e.g. person's full
        name or shortened name or nickname

    * prop name: "svn:author:email"
      purpose: email address
      format: [TO BE SPECIFIED HERE]


  Other property names in this name space beginning with "svn:author:" can be designated as "well known" in the future, by an official announcement from the Subversion project.

  An administrator can configure other extended author fields to use property names that are not in the "svn:" name space.  Example: an administrator could configure the property name "author:pgp-sig" to hold the author's PGP signature.



SERVER DESIGN
  Any time the server is about to send a set of revision properties to 
the client, the server looks up the extended author fields and adds 
corresponding properties to the set of revision properties that it 
reports to the client.  These property values override any values The server looks up the extended author fieldsthrough some mechanism not defined here,using the value of the"svn:author" property as a key.  The server may cache the results, provided that there is a way for the administrator to make the server use updated information.


  If the client attempts to set any revision property in the "svn:author:" name space, the server shall report an error to the client.  This applies even if the property value matches the value that was last read from the server or is currently known to the server, and even if the 
specific property name is not known to the server.  If the client attempts to set any revision property that is not in the "svn:author:" name space but might be configured as an extended author field, the server records that revision property in the normal way.  If a revision property (of any name) has a stored value and the extended author field look-up also provides a value for the same property name, the latter takes priority.


  The extended author fields [are | are not] available to the following hook scripts: pre-commit, ...


CLIENT DESIGN

  Just an example.  The "svn log" and "svn blame" commands could request the revision property named "svn:author:display-name", and if that is returned then use it instead of "svn:author", otherwise use the value of "svn:author".  Further, a client-side configuration option could specify which property name should be used for these display purposes, so for example some users in a particular team could choose to have the "author:nickname" revision property displayed instead of "svn:author:display-name".



FURTHER SCOPE

  Does a client need to be able to look up the information in other ways, such as starting from svn:author rather than a revision number, or starting from an extended author field?


- Julian

Re: format of svn:author

Posted by Branko Čibej <br...@apache.org>.

On 04.01.2012 13:50, Mark Mielke wrote:
> Branko: If "svn log", "svn blame", and anything like TortoiseSVN or
> Subclipse were to support this, you might have a point. As it is,
> anybody with teams large enough such that the unique identifier is not
> recognizable (i.e. committer A immediately recognizes and knows that
> unique identifier for committer B) needs to FUDGE svn:author to
> include additional information which is not really part of the unique
> identifier at all and is only a humanly representable version of the
> unique identifier, and this leads to:
>
> 1) Breakage in other tools. Committer mappings don't work.
> 2) The unique identifier is now not correct as it includes non-unique,
> non-permanent details that change.

I understand all this, but how do you propose that, e.g., "svn blame"
would guess /which/ of the alternative identification tokens it's
supposed to show? If you don't want to always show the unique ID, then
obviously you'd choose one of the alternatives based on ... what? The
identity of the invoker of the command? Some other criterion? Things
quickly become horribly hairy.

Without specific use cases and examples, it's hard to come up with any
kind of coherent identification scheme that's different from what we
have now.

-- Brane

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

Branko: If "svn log", "svn blame", and anything like TortoiseSVN or 
Subclipse were to support this, you might have a point. As it is, 
anybody with teams large enough such that the unique identifier is not 
recognizable (i.e. committer A immediately recognizes and knows that 
unique identifier for committer B) needs to FUDGE svn:author to include 
additional information which is not really part of the unique identifier 
at all and is only a humanly representable version of the unique 
identifier, and this leads to:

1) Breakage in other tools. Committer mappings don't work.
2) The unique identifier is now not correct as it includes non-unique, 
non-permanent details that change.

But now I'm repeating myself. I think the problem here is that people 
are theorizing about subjects that they have not had to deal with real 
life problems for. Theory vs practice. You can say that Enterprise like 
to repeat things - but I'm not sure you understand what Enterprise is 
doing... it just looks like repeating to you, and so you assume it has 
no purpose, and therefore no merit.

On 01/04/2012 05:35 AM, Branko Čibej wrote:
> On 04.01.2012 11:09, Vincent Lefevre wrote:
>> On 2012-01-03 15:44:47 +0100, Branko Čibej wrote:
>>> I think this whole thread is slightly bogus. It should be obvious that
>>> whatever is in the svn:author field has better be a unique identifier of
>>> the person responsible for the commit, regardless of how it gets there.
>> I'd say that this choice should entirely be made by the administrator
>> of the repository.
> Exactly. And we give that choice, at least for Apache-embedded servers
> (which is what enterprises will use, I hope).
>
> If we, say, added another property where admins could write a whole
> other set of information, we'd either have to define the format (and
> incidentally tee off the 90% who want a different format), or leave the
> contents up to the administrator (and tee off the other 90% who want
> compatibility across diverse installations).
>
> I still don't understand why it's so hard for other tools to, e.g., look
> up the svn:author unique ID on an LDAP server somewhere. Otherwise we're
> effectively duplicating (a small part of) any of the "standard"
> directory services.
>
> (Yeah, I know that "enterprise" tools like to duplicate functionality
> and mess up open standards while they're at it, but I don't see why we
> should be doing the same.)
>
> -- Brane

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Branko Čibej <br...@apache.org>.

On 04.01.2012 11:09, Vincent Lefevre wrote:
> On 2012-01-03 15:44:47 +0100, Branko Čibej wrote:
>> I think this whole thread is slightly bogus. It should be obvious that
>> whatever is in the svn:author field has better be a unique identifier of
>> the person responsible for the commit, regardless of how it gets there.
> I'd say that this choice should entirely be made by the administrator
> of the repository.

Exactly. And we give that choice, at least for Apache-embedded servers
(which is what enterprises will use, I hope).

If we, say, added another property where admins could write a whole
other set of information, we'd either have to define the format (and
incidentally tee off the 90% who want a different format), or leave the
contents up to the administrator (and tee off the other 90% who want
compatibility across diverse installations).

I still don't understand why it's so hard for other tools to, e.g., look
up the svn:author unique ID on an LDAP server somewhere. Otherwise we're
effectively duplicating (a small part of) any of the "standard"
directory services.

(Yeah, I know that "enterprise" tools like to duplicate functionality
and mess up open standards while they're at it, but I don't see why we
should be doing the same.)

-- Brane

Re: format of svn:author

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2012-01-03 15:44:47 +0100, Branko Čibej wrote:
> I think this whole thread is slightly bogus. It should be obvious that
> whatever is in the svn:author field has better be a unique identifier of
> the person responsible for the commit, regardless of how it gets there.

I'd say that this choice should entirely be made by the administrator
of the repository. For instance, for my personal repository, I am
the only person who commits (that's the definition of a personal
repository), so that I choose to put in svn:author the machine (or
network) from which I do the commit.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / Arénaire project (LIP, ENS-Lyon)

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/02/2012 04:48 AM, Alan Barrett wrote:
> On Mon, 02 Jan 2012, Mark Mielke wrote:
>>> If your third party tools can't extract the unique ID from 
>>> svn:author = "Display Name <un...@domain>" then perhaps the 
>>> problem lies at least as much in your third party tools as in 
>>> subversion.
>>
>> I wonder if you thought this through before posting. :-)
>>
>> You are saying that if I make up an essentially arbitrary scheme, 
>> such as "Display Name <un...@domain>", and you have a tool which 
>> is unaware of my scheme, and therefore your tool fails to matches 
>> users in the region because of my scheme - that your tool has the 
>> problem?
>
> It's a free text field, although it's probably a bad idea to put more 
> than one line of text there.  As the administrator who sets up the svn 
> repository, you are responsible for choosing what text you put in 
> svn:author.  If, as you said, you have tools that want to be able to 
> map it to a a more restricted type, such as a login name, or employee 
> number, or (part of) an email address, then the tool is responsible 
> for performing the mapping.  If the tool can't perform the mapping 
> then, yes, I say that the tool is incompatible with the way the 
> repository administrator has chosen to use the svn:author field.

No. I don't control the hundreds of tools that support Subversion. The 
tools cannot be responsible for conventions they are unaware of. I think 
you are thinking of the tiny little scope where the only components in 
the system are Subversion itself and tools that I (or you) are directly 
responsible for and have the power to change. This is an extremely small 
view of the problem.

>> Otherwise, only extremely casual interpretation can be done of the 
>> field. For example, it can be treated as a unique identifier - but 
>> more like a "foreign key" unique identifier in the sense that it is a 
>> key in some domain, but not necessarily a domain I know about or am 
>> an authority for.
> As the administrator who sets up the svn repository, and the hooks 
> that edit or validate the data as it goes into the svn:author field, 
> you have absolute control over the data format, so it's not fair to 
> say that it's in a domain that you don't know about -- It's in a 
> domain that you choose.  Whatever format you choose, you should make 
> sure your other tools can deal with it.

Only in the extremely small view that I describe above. So not really 
relevant to the real requirements.

>
>> Our exact compromise for the last three years is:
>>
>> 1) original svn:author value arrives on the server as as "1234567" - 
>> a corporate unique identifier
>> 2) pre-commit re-writes svn:author to "Full Name (<original 
>> svn:author value>)"
>> 3) pre-commit adds <company>:gid as "<original svn:author value>"
>>
>> Then as I mention - various other tools such as FishEye have explicit 
>> mappings from "Mark Mielke (1234567)" => "1234567" for each 
>> Subversion repository. We're primarily a ClearCase and Perforce shop 
>> right now - but even so, I have several Subversion repository 
>> mappings of this form. It works. It just sucks.
>
> If FishEye needs a huge mapping table from "text as it appears in 
> svn:author" to "unique id", with a row in the table for every possible 
> ID, then this process will be very painful for you; on the other hand, 
> if you could configure FishEye to extract the "1234567" from "Mark 
> Mielke (1234567)" using a regular expression or other string 
> manipulation technique, then it would be much more maintainable.

It is not reasonable for a Subversion user to customize every tool they 
use. It is far preferred for Subversion to provide the solution as a 
core function.

> I expect that changes on the subversion side could help (as you have 
> mentioned, adding more properties, or clearly documenting one or more 
> suggested ways of providing structure inside svn:author, or both), but 
> I still hold the opinion that your pain is caused at least as much by 
> FishEye as by svn.

More than help. It is the only true solution. Anything else - such as 
each Subversion user customizing their own tools - is entirely a hack.

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Alan Barrett <ap...@cequrux.com>.

On Mon, 02 Jan 2012, Mark Mielke wrote:
>> If your third party tools can't extract the unique ID from 
>> svn:author = "Display Name <un...@domain>" then perhaps the 
>> problem lies at least as much in your third party tools as in 
>> subversion.
>
> I wonder if you thought this through before posting. :-)
>
> You are saying that if I make up an essentially arbitrary 
> scheme, such as "Display Name <un...@domain>", and you have 
> a tool which is unaware of my scheme, and therefore your tool 
> fails to matches users in the region because of my scheme - that 
> your tool has the problem?

It's a free text field, although it's probably a bad idea to put 
more than one line of text there.  As the administrator who sets 
up the svn repository, you are responsible for choosing what text 
you put in svn:author.  If, as you said, you have tools that want 
to be able to map it to a a more restricted type, such as a login 
name, or employee number, or (part of) an email address, then the 
tool is responsible for performing the mapping.  If the tool can't 
perform the mapping then, yes, I say that the tool is incompatible 
with the way the repository administrator has chosen to use the 
svn:author field.

> Otherwise, only extremely casual interpretation can be done of 
> the field. For example, it can be treated as a unique identifier 
> - but more like a "foreign key" unique identifier in the sense 
> that it is a key in some domain, but not necessarily a domain I 
> know about or am an authority for.

As the administrator who sets up the svn repository, and the hooks 
that edit or validate the data as it goes into the svn:author 
field, you have absolute control over the data format, so it's not 
fair to say that it's in a domain that you don't know about -- 
It's in a domain that you choose.  Whatever format you choose, you 
should make sure your other tools can deal with it.

> Our exact compromise for the last three years is:
>
> 1) original svn:author value arrives on the server as as 
> "1234567" - a corporate unique identifier
> 2) pre-commit re-writes svn:author to "Full Name (<original 
> svn:author value>)"
> 3) pre-commit adds <company>:gid as "<original svn:author 
> value>"
>
> Then as I mention - various other tools such as FishEye have 
> explicit mappings from "Mark Mielke (1234567)" => "1234567" for 
> each Subversion repository. We're primarily a ClearCase and 
> Perforce shop right now - but even so, I have several Subversion 
> repository mappings of this form. It works. It just sucks.

If FishEye needs a huge mapping table from "text as it appears 
in svn:author" to "unique id", with a row in the table for every 
possible ID, then this process will be very painful for you; on 
the other hand, if you could configure FishEye to extract the 
"1234567" from "Mark Mielke (1234567)" using a regular expression 
or other string manipulation technique, then it would be much more 
maintainable.

I expect that changes on the subversion side could help (as you 
have mentioned, adding more properties, or clearly documenting one 
or more suggested ways of providing structure inside svn:author, 
or both), but I still hold the opinion that your pain is caused at 
least as much by FishEye as by svn.

--apb (Alan Barrett)

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

FYI that "full name" or "email address" are not actually aspects of a 
unique identifier. People's names change, and email addresses change. 
The unique identifier should normally be much more persistent and should 
enable cross referencing with other tools and database reports. The name 
and email is for human consumption. The unique identifier is for machine 
consumption. Subversion has chosen to define only one attribute to 
represent both which makes it extremely difficult to get either. We're 
talking about changing customizing the software for dozens of open 
source and commercial products, just to make the "full name" visible to 
users. But without a standard or convention - we're talking about each 
organization defining their own standard or convention and providing 
their own customization to dozens of tools. This works against the open 
source community being leveraged to provide solutions which benefit the 
most people from a shared component.

In the below - Branko seems to suggest that because there is a lot of 
material on this "out there" and lots of choices, therefore it isn't the 
place for Subversion to step in and choose one to adopt. I suggest that 
the reason there is a wealth of material out there is because the 
subject is important and that the reason a standard is preferred is 
specifically because it allows integration between many tools from many 
providers.

On 01/03/2012 09:44 AM, Branko Čibej wrote:
> I think this whole thread is slightly bogus. It should be obvious that
> whatever is in the svn:author field has better be a unique identifier of
> the person responsible for the commit, regardless of how it gets there.
> Once that requirement is met, everything else is "simply" a matter of
> getting the repository administrator to set up that identifier in such a
> way that the tools user by the users of that repository can do something
> useful with it.
>
> I propose that this is /entirely/ in the domain of the organization that
> is maintaining the Subversion installation. There is no standard way of
> identifying all pertinent user information -- or rather, there are some
> 57 different standards. There's nothing stopping the repository
> administrator from writing a pre-commit hook that adds tailored revprops
> with identifiers that comply with all those 57 standards and any custom
> ones, too. Asking Subversion to add reserved revprop names for all
> possible (not even plausible) identification schemes would be a bit like
> asking to add a different boolean property for every known character
> encoding -- in other words, an explosion of reserved property names that
> would, in general, give no benefit to the vast majority of users.
>
> All that would happen is that different organizations would abuse those
> property names in different, incompatible ways.
>
> -- Brane

-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

To be blunt - this is exactly why Subversion will stay small. When the 
main people on the developer list hold small world views such as "it is 
the responsibility of the organization that uses Subversion to customize 
the dozens of tools they integrate with in a non-standard way", it is 
guaranteed that Subversion adoption cannot go beyond a certiain 
threshold. Which is fine. Sometimes you need small. It is simply not 
feasible for every organization to customize every tool they use. The 
thought itself is ridiculous.

But if this is truly the opinion, then my efforts here are wasted. Other 
solutions provide these capabilities out of box.

On 01/03/2012 09:44 AM, Branko Čibej wrote:
> On 03.01.2012 04:02, Stefan Fuhrmann wrote:
>> * What is an author?
>> * How do concepts like "account", "person",
>>    "role", "group" relate to that notion?
>> * What aspects of the above can be provided to /
>>    handled by Subversion in a portable way?
>> * What are typical use-cases and do they match
>>    with the definitions you use?
>>
>>>      svn:author =>  unique identifier
>> That seems to be the hardest to define and may
>> be difficult to provide. Identifies the person?
>> PGP key ID?
>>>      svn:author-name =>  Mark Mielke
>> That would denote the "person". How would
>> duplicates and name changes be handled?
>>>      svn:author-email =>  mark@mark.mielke.cc
>> That looks close to the "account" aspect.
> I think this whole thread is slightly bogus. It should be obvious that
> whatever is in the svn:author field has better be a unique identifier of
> the person responsible for the commit, regardless of how it gets there.
> Once that requirement is met, everything else is "simply" a matter of
> getting the repository administrator to set up that identifier in such a
> way that the tools user by the users of that repository can do something
> useful with it.
>
> I propose that this is /entirely/ in the domain of the organization that
> is maintaining the Subversion installation. There is no standard way of
> identifying all pertinent user information -- or rather, there are some
> 57 different standards. There's nothing stopping the repository
> administrator from writing a pre-commit hook that adds tailored revprops
> with identifiers that comply with all those 57 standards and any custom
> ones, too. Asking Subversion to add reserved revprop names for all
> possible (not even plausible) identification schemes would be a bit like
> asking to add a different boolean property for every known character
> encoding -- in other words, an explosion of reserved property names that
> would, in general, give no benefit to the vast majority of users.
>
> All that would happen is that different organizations would abuse those
> property names in different, incompatible ways.
>
> -- Brane


-- 
Mark Mielke<ma...@mielke.cc>

Re: format of svn:author

Posted by Branko Čibej <br...@apache.org>.

On 03.01.2012 04:02, Stefan Fuhrmann wrote:
> * What is an author?
> * How do concepts like "account", "person",
>   "role", "group" relate to that notion?
> * What aspects of the above can be provided to /
>   handled by Subversion in a portable way?
> * What are typical use-cases and do they match
>   with the definitions you use?
>
>>     svn:author => unique identifier
> That seems to be the hardest to define and may
> be difficult to provide. Identifies the person?
> PGP key ID?
>>     svn:author-name => Mark Mielke
> That would denote the "person". How would
> duplicates and name changes be handled?
>>     svn:author-email => mark@mark.mielke.cc
> That looks close to the "account" aspect.

I think this whole thread is slightly bogus. It should be obvious that
whatever is in the svn:author field has better be a unique identifier of
the person responsible for the commit, regardless of how it gets there.
Once that requirement is met, everything else is "simply" a matter of
getting the repository administrator to set up that identifier in such a
way that the tools user by the users of that repository can do something
useful with it.

I propose that this is /entirely/ in the domain of the organization that
is maintaining the Subversion installation. There is no standard way of
identifying all pertinent user information -- or rather, there are some
57 different standards. There's nothing stopping the repository
administrator from writing a pre-commit hook that adds tailored revprops
with identifiers that comply with all those 57 standards and any custom
ones, too. Asking Subversion to add reserved revprop names for all
possible (not even plausible) identification schemes would be a bit like
asking to add a different boolean property for every known character
encoding -- in other words, an explosion of reserved property names that
would, in general, give no benefit to the vast majority of users.

All that would happen is that different organizations would abuse those
property names in different, incompatible ways.

-- Brane

Re: format of svn:author

Posted by Stefan Fuhrmann <eq...@web.de>.

On 02.01.2012 09:34, Mark Mielke wrote:
> On 01/02/2012 02:52 AM, Alan Barrett wrote:
>> On Sun, 01 Jan 2012, Mark Mielke wrote:
>>>> Another idea is to change the revprop's value in the pre-commit or 
>>>> post-commit hook: [...]
>>>
>>> This is what we've been doing for about two years. It has the 
>>> consequence that tools don't automatically match unique identifier 
>>> to commit as they no longer match.
>>
>> If your third party tools can't extract the unique ID from svn:author 
>> = "Display Name <un...@domain>" then perhaps the problem lies at 
>> least as much in your third party tools as in subversion.
>
> I wonder if you thought this through before posting. :-)
>

Hi Mark,

I only loosely followed this thread but I still want
to throw my 2 cents in here.

It seems that you are looking for a formal way
(i.e. accessible to tools) to identify the "author"
of a change and you listed some problems that
you are facing with the current state of things.
To me, it seems that there actually *is* plenty room
for improvement and simply nobody really brought
the topic up until now.

If you are working towards a proposal (problem
description + proposed solution), make sure you
start your analysis at the very basic. Such as:

* What is an author?
* How do concepts like "account", "person",
   "role", "group" relate to that notion?
* What aspects of the above can be provided to /
   handled by Subversion in a portable way?
* What are typical use-cases and do they match
   with the definitions you use?

>     svn:author => unique identifier
That seems to be the hardest to define and may
be difficult to provide. Identifies the person?
PGP key ID?
>     svn:author-name => Mark Mielke
That would denote the "person". How would
duplicates and name changes be handled?
>     svn:author-email => mark@mark.mielke.cc
That looks close to the "account" aspect.

-- Stefan^2.

Re: format of svn:author

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 01/02/2012 02:52 AM, Alan Barrett wrote:
> On Sun, 01 Jan 2012, Mark Mielke wrote:
>>> Another idea is to change the revprop's value in the pre-commit or 
>>> post-commit hook: [...]
>>
>> This is what we've been doing for about two years. It has the 
>> consequence that tools don't automatically match unique identifier to 
>> commit as they no longer match.
>
> If your third party tools can't extract the unique ID from svn:author 
> = "Display Name <un...@domain>" then perhaps the problem lies at 
> least as much in your third party tools as in subversion.

I wonder if you thought this through before posting. :-)

You are saying that if I make up an essentially arbitrary scheme, such 
as "Display Name <un...@domain>", and you have a tool which is 
unaware of my scheme, and therefore your tool fails to matches users in 
the region because of my scheme - that your tool has the problem? 
Despite the documentation for Subversion never mentioning or even 
suggesting a convention that you should be responsible for understanding?

No.

The convention must be defined in the Subversion book, and it must be 
part of the release notes so that third party tools adhere to the 
convention.

Otherwise, only extremely casual interpretation can be done of the 
field. For example, it can be treated as a unique identifier - but more 
like a "foreign key" unique identifier in the sense that it is a key in 
some domain, but not necessarily a domain I know about or am an 
authority for. This is why tools such as FishEye provide a "committer 
mapping" that is precisely this. It allows me to code on a 
per-repository basis each of the committer values that I want to 
associate with my own FishEye account. This is really horrible for 
dozens of repositories and thousands of users. Every user having to 
input their own mappings? Yuck, yuck, yuck.

If, instead, a convention was defined such that (and just hand waving 
here, I'm not really attached to these details):

     svn:author => unique identifier
     svn:author-name => Mark Mielke
     svn:author-email => mark@mark.mielke.cc

Then tools could make much more intelligent decisions on what to do or 
show. They could use svn:author as the mapping key, but show name and 
email in "svn log" or graphical browsers.

The above model is a simple solution to the problem. More data stored 
for every commit. Data which can be used by downstream tools. This has a 
benefit in that the data is static which is sometimes good. In a large 
project, there is normally a turnover, and accounts that exists or are 
active in one year are not necessarily the same as the ones active in 
another year. By taking a snapshot of the data at the time of commit, it 
represents a permanent record of sorts. ClearCase is a system which does 
it this way. Event history records which track such things as object 
creation which is the closest map to svn:author have username, domain 
(NIS - old school), and fullname.

The other alternative is for a Subversion client to be able to lookup 
details for svn:author by asking the server using a published protocol. 
This model would allow the server to implement these queries 
transparently using LDAP lookups or similar depending on the 
requirements of the project. This stores less data for every commit, and 
allows for dynamic updates. It would allow for "Mark Mielke" to become 
"Mielke, Mark" with a server side configuration, but in contrast to the 
previous method, it would not all for a snapshot of history to be taken. 
It would be a requirement that the identity management system used on 
the server would always have a record for me even after I am gone - or  
- alternatively, that the detail would become more vague over time. I 
disappear, and my account disappears - so it is left with only a unique 
identifier which might not be enough information.

In our particular case, we value all three of: 1) unique identifiers to 
be able to do cross referencing of reports between tools, 2) display of 
humanly readable names in output such as "svn log" or annotations in 
FishEye, ViewVC, Eclipse, or whatever tool the user is using, and 3) 
permanent historical record for auditing purposes.

Our exact compromise for the last three years is:

1) original svn:author value arrives on the server as as "1234567" - a 
corporate unique identifier
2) pre-commit re-writes svn:author to "Full Name (<original svn:author 
value>)"
3) pre-commit adds <company>:gid as "<original svn:author value>"

Then as I mention - various other tools such as FishEye have explicit 
mappings from "Mark Mielke (1234567)" => "1234567" for each Subversion 
repository. We're primarily a ClearCase and Perforce shop right now - 
but even so, I have several Subversion repository mappings of this form. 
It works. It just sucks.

For svn:author to have structure - either internally using punctuation 
such as Unix gecos, or separated out as separate attributes - and for 
tools to all honour this structure - would be far more ideal. As 
Subversion is already well established, separate attributes is probably 
the best approach as it would enable forwards and backwards 
compatibility for uses of svn:author implemented by the Subversion code 
base itself. Tools that know how to access and do intelligent things 
with the new fields could feel free to do so. Users of tools that do not 
do something intelligent things with the new fields could point to the 
Subversion release notes and Subversion book and say "this new attribute 
svn:author-name should be recognized by your tool", the change can make 
the tool roadmap, and we can all be happy.

-- 
Mark Mielke<ma...@mielke.cc>

format of svn:author

Posted by Alan Barrett <ap...@cequrux.com>.

On Sun, 01 Jan 2012, Mark Mielke wrote:
>> Another idea is to change the revprop's value in the pre-commit 
>> or post-commit hook: [...]
>
> This is what we've been doing for about two years. It has 
> the consequence that tools don't automatically match unique 
> identifier to commit as they no longer match.

If your third party tools can't extract the unique ID from 
svn:author = "Display Name <un...@domain>" then perhaps the 
problem lies at least as much in your third party tools as in 
subversion.

--apb (Alan Barrett)

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 12/31/2011 09:21 PM, Daniel Shahaf wrote:
> Mark Mielke wrote on Sat, Dec 31, 2011 at 01:00:12 -0500:
>> On 12/30/2011 09:35 PM, Daniel Shahaf wrote:
>>> AuthLDAPRemoteUserAttribute cn
>>>
>>> Then you can do
>>>
>>> % svn commit --username "Daniel Shahaf"
>>>
>>> and the logs will show
>>>
>>> ------------------------------------------------------------------------
>>> r1 | Daniel Shahaf | strftime(...) | 1 line
>>> ------------------------------------------------------------------------
>> We use this for a few services - but note how now instead of losing
>> the full name, it now loses the unique identifier. In a company of
>> 1,000+ people, there is a good chance for overlap of "cn". There
>> might be only one Mark Mielke, but other names such as John Sullivan
>> there could be many. The "cn" is not a unique identifier and cannot
>> be used to key off. It is for display purposes only.
>>
> Another idea is to change the revprop's value in the pre-commit or
> post-commit hook:
>      ..
>      author=`svnlook propget --revprop -t $TXN svn:author`
>      svnadmin setrevprop -t $TXN svn:author "`getent passwd $author | cut -d: -f5 | cut -d, -f1`<$a...@localdomain>"
>      ..
> and then people still authenticate with their uid's, but all existing
> tools will automatically show DVCS-style name+address author names.

This is what we've been doing for about two years. It has the 
consequence that tools don't automatically match unique identifier to 
commit as they no longer match.

> And if _that_ 's not good enough... what Stefan said: someone needs to
> sit down, define a problem, design a solution, and push it through.
> Perhaps it's as simple as defining a few new revprops?

Yes. This is what would be required to address this requirement permanently.

-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Mark Mielke wrote on Sat, Dec 31, 2011 at 01:00:12 -0500:
> On 12/30/2011 09:35 PM, Daniel Shahaf wrote:
> >Mark Mielke wrote on Fri, Dec 30, 2011 at 20:22:50 -0500:
> >
> >>I think you are not understanding my concern. If svn:author is only
> >>ever displayed to the user - then "authenticated username" may not
> >>be a desirable form to use. For teams of 10 people, sure you can
> >>recognize the uid of everybody in the team. But what about teams of
> >>100, or teams of 1000?
> >AuthLDAPRemoteUserAttribute cn
> >
> >Then you can do
> >
> >% svn commit --username "Daniel Shahaf"
> >
> >and the logs will show
> >
> >------------------------------------------------------------------------
> >r1 | Daniel Shahaf | strftime(...) | 1 line
> >------------------------------------------------------------------------
> 
> We use this for a few services - but note how now instead of losing
> the full name, it now loses the unique identifier. In a company of
> 1,000+ people, there is a good chance for overlap of "cn". There
> might be only one Mark Mielke, but other names such as John Sullivan
> there could be many. The "cn" is not a unique identifier and cannot
> be used to key off. It is for display purposes only.
> 

Another idea is to change the revprop's value in the pre-commit or
post-commit hook:
    ..
    author=`svnlook propget --revprop -t $TXN svn:author`
    svnadmin setrevprop -t $TXN svn:author "`getent passwd $author | cut -d: -f5 | cut -d, -f1` <$a...@localdomain>"
    ..
and then people still authenticate with their uid's, but all existing
tools will automatically show DVCS-style name+address author names.

And if _that_ 's not good enough... what Stefan said: someone needs to
sit down, define a problem, design a solution, and push it through.
Perhaps it's as simple as defining a few new revprops?

> All of this falls under the banner of thinking small. Small teams.
> Few requirements. Most products are like this. Sorry. I know you are
> just trying to help. :-)
> 
> -- 
> Mark Mielke<ma...@mielke.cc>
>

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 12/30/2011 09:35 PM, Daniel Shahaf wrote:
> Mark Mielke wrote on Fri, Dec 30, 2011 at 20:22:50 -0500:
>
>> I think you are not understanding my concern. If svn:author is only
>> ever displayed to the user - then "authenticated username" may not
>> be a desirable form to use. For teams of 10 people, sure you can
>> recognize the uid of everybody in the team. But what about teams of
>> 100, or teams of 1000?
> AuthLDAPRemoteUserAttribute cn
>
> Then you can do
>
> % svn commit --username "Daniel Shahaf"
>
> and the logs will show
>
> ------------------------------------------------------------------------
> r1 | Daniel Shahaf | strftime(...) | 1 line
> ------------------------------------------------------------------------

We use this for a few services - but note how now instead of losing the 
full name, it now loses the unique identifier. In a company of 1,000+ 
people, there is a good chance for overlap of "cn". There might be only 
one Mark Mielke, but other names such as John Sullivan there could be 
many. The "cn" is not a unique identifier and cannot be used to key off. 
It is for display purposes only.

All of this falls under the banner of thinking small. Small teams. Few 
requirements. Most products are like this. Sorry. I know you are just 
trying to help. :-)

-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Mark Mielke wrote on Fri, Dec 30, 2011 at 20:22:50 -0500:
> On 12/30/2011 02:36 PM, Daniel Shahaf wrote:
> >Mark Mielke wrote on Fri, Dec 30, 2011 at 14:24:25 -0500:
> >
> >>With the caveat being that tools that assume that svn:author was set by
> >>the Subversion API may no longer recognize the author...
> >"by the Subversion API" isn't the right phrase, but anyway: by default
> >the svn:author properties are set to the authenticated username (when
> >such is available) and cannot be changed unless the administrator runs
> >either 'svnadmin' or installs a pre-revprop-change hook.
> >
> >There's nothing preventing people from having, say, spaces in their
> >usernames.  So clients should cope with that.
> >
> >There is also nothing preventing people from having newlines in their
> >usernames... but now, the fact the library would allow this doesn't mean
> >it's a good idea to do this.
> 
> I think you are not understanding my concern. If svn:author is only
> ever displayed to the user - then "authenticated username" may not
> be a desirable form to use. For teams of 10 people, sure you can
> recognize the uid of everybody in the team. But what about teams of
> 100, or teams of 1000?

AuthLDAPRemoteUserAttribute cn

Then you can do

% svn commit --username "Daniel Shahaf" 

and the logs will show

------------------------------------------------------------------------
r1 | Daniel Shahaf | strftime(...) | 1 line
------------------------------------------------------------------------

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 12/30/2011 02:36 PM, Daniel Shahaf wrote:
> Mark Mielke wrote on Fri, Dec 30, 2011 at 14:24:25 -0500:
>
>> With the caveat being that tools that assume that svn:author was set by
>> the Subversion API may no longer recognize the author...
> "by the Subversion API" isn't the right phrase, but anyway: by default
> the svn:author properties are set to the authenticated username (when
> such is available) and cannot be changed unless the administrator runs
> either 'svnadmin' or installs a pre-revprop-change hook.
>
> There's nothing preventing people from having, say, spaces in their
> usernames.  So clients should cope with that.
>
> There is also nothing preventing people from having newlines in their
> usernames... but now, the fact the library would allow this doesn't mean
> it's a good idea to do this.

I think you are not understanding my concern. If svn:author is only ever 
displayed to the user - then "authenticated username" may not be a 
desirable form to use. For teams of 10 people, sure you can recognize 
the uid of everybody in the team. But what about teams of 100, or teams 
of 1000?

The Subversion documentation does not define any structure for this 
attribute, therefore tools either assume that the Subversion API (or SDK 
or whatever you want to call it...) initiated it to the authenticated 
username. Some of them display this "as is". Some expand it using user 
databases such as LDAP. Everybody does it different. This is a problem. 
If you use a tool that doesn't know about other databases (such as "svn 
log"), then it only shows the uid. If you work around this by including 
structure in svn:author, because there is no standard, chances are that 
no other tool will understand what you are doing, and it will break 
mappings. For example - in Crucible/FishEye - it won't know who to 
associate the commit with, so that when I login I can see my commit 
history. Instead, I have to manually define mappings.

In any case - this is just yet another example of how Subversion really 
doesn't scale. That it still can't properly merge across branches or 
renames is much more important...

>> We're still being hit by this, but choosing to take this hit. Our
>> identifiers are not easily legible ("1234567"), so we translate
>> svn:author to "Full Name (GID)" format during commit. Tools such as
>> Crucible/FishEye can sort of work with this via the committer mappings.
>>
>> Just mentioning as I think the "svn:author" being one arbitrary
>> byte-string with no specification defining how structure can be added in
>> a portable way that tools should understand and support means that it
>> really isn't that flexible. It is a compromise.
>>
>> In the case of GIT ->  SVN, the retaining of the information could be a
>> valid compromise. But it is still a loss of information, at least as far
>> as how tools such as Crucible/FishEye might interpret the content.
>>
>> -- 
>> Mark Mielke<ma...@mielke.cc>
>>

-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Mark Mielke wrote on Fri, Dec 30, 2011 at 14:24:25 -0500:
> On 12/29/2011 11:09 PM, Daniel Shahaf wrote:
>> Steinar Bang wrote on Thu, Dec 29, 2011 at 22:55:09 +0100:
>>> Couldn't you just store this information as custom properties?  Even
>>> though svn isn't able to use it, at least the information would be
>>> preserved...?
>>>
>> Re "throw away the domains": svn:author properties be set to any
>> byte-string which is NUL-terminated UTF-8 that doesn't contain
>> (0x0D).
>
> With the caveat being that tools that assume that svn:author was set by  
> the Subversion API may no longer recognize the author...
>

"by the Subversion API" isn't the right phrase, but anyway: by default
the svn:author properties are set to the authenticated username (when
such is available) and cannot be changed unless the administrator runs
either 'svnadmin' or installs a pre-revprop-change hook.

There's nothing preventing people from having, say, spaces in their
usernames.  So clients should cope with that.

There is also nothing preventing people from having newlines in their
usernames... but now, the fact the library would allow this doesn't mean
it's a good idea to do this.

</random thoughts>

> We're still being hit by this, but choosing to take this hit. Our  
> identifiers are not easily legible ("1234567"), so we translate  
> svn:author to "Full Name (GID)" format during commit. Tools such as  
> Crucible/FishEye can sort of work with this via the committer mappings.
>
> Just mentioning as I think the "svn:author" being one arbitrary  
> byte-string with no specification defining how structure can be added in  
> a portable way that tools should understand and support means that it  
> really isn't that flexible. It is a compromise.
>
> In the case of GIT -> SVN, the retaining of the information could be a  
> valid compromise. But it is still a loss of information, at least as far  
> as how tools such as Crucible/FishEye might interpret the content.
>
> -- 
> Mark Mielke<ma...@mielke.cc>
>

Re: Problems with the documentation of Subversion dump format

Posted by Mark Mielke <ma...@mark.mielke.cc>.

On 12/29/2011 11:09 PM, Daniel Shahaf wrote:
> Steinar Bang wrote on Thu, Dec 29, 2011 at 22:55:09 +0100:
>> Couldn't you just store this information as custom properties?  Even
>> though svn isn't able to use it, at least the information would be
>> preserved...?
>>
> Re "throw away the domains": svn:author properties be set to any
> byte-string which is NUL-terminated UTF-8 that doesn't contain
> (0x0D).

With the caveat being that tools that assume that svn:author was set by 
the Subversion API may no longer recognize the author...

We're still being hit by this, but choosing to take this hit. Our 
identifiers are not easily legible ("1234567"), so we translate 
svn:author to "Full Name (GID)" format during commit. Tools such as 
Crucible/FishEye can sort of work with this via the committer mappings.

Just mentioning as I think the "svn:author" being one arbitrary 
byte-string with no specification defining how structure can be added in 
a portable way that tools should understand and support means that it 
really isn't that flexible. It is a compromise.

In the case of GIT -> SVN, the retaining of the information could be a 
valid compromise. But it is still a loss of information, at least as far 
as how tools such as Crucible/FishEye might interpret the content.

-- 
Mark Mielke<ma...@mielke.cc>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Steinar Bang wrote on Thu, Dec 29, 2011 at 22:55:09 +0100:
> >>>>> "Eric S. Raymond" <es...@thyrsus.com>:
> 
> > On the other hand, when you *start* in gitspace, mapping back down to
> > the set of abstractions svn can handle is really lossy.  You have to
> > throw away the domains on committer names, all the author fields, real
> > (annotated) tags, and branch merges.
> 
> Couldn't you just store this information as custom properties?  Even
> though svn isn't able to use it, at least the information would be
> preserved...? 
> 

Re "throw away the domains": svn:author properties be set to any
byte-string which is NUL-terminated UTF-8 that doesn't contain CR
(0x0D).

Re: Problems with the documentation of Subversion dump format

Posted by Steinar Bang <sb...@dod.no>.

>>>>> "Eric S. Raymond" <es...@thyrsus.com>:

> On the other hand, when you *start* in gitspace, mapping back down to
> the set of abstractions svn can handle is really lossy.  You have to
> throw away the domains on committer names, all the author fields, real
> (annotated) tags, and branch merges.

Couldn't you just store this information as custom properties?  Even
though svn isn't able to use it, at least the information would be
preserved...?

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Daniel Shahaf <d....@daniel.shahaf.name>:
> Eric S. Raymond wrote on Tue, Dec 13, 2011 at 13:34:03 -0500:
> > Self-defense, I assure you.  I'm attempting to build a better SVN-to-DVCS
> > converter than exists anywhere now, and the best way to understand the
> > dump format well enough to do that is to document it in detail.
> 
> Curious if you also intend to support X-to-SVN conversion in your tool.

Probably not, though I have considered it and might change my mind.

Here's what's going on.  I've explained reposurgeon here before; to
recap, it uses the fact that lots of DVCSes can speak git import
streams to act as a common editor for all of them.  Essentially, it
says to the VCS's exporter "give me a stream dump", deserializes the
result, you edit, then it re-serializes the new state and feeds it to
an importer.

This works nicely for git and hg; a little less well, though
acceptably, for bzr.  bzr's problem is that it's confused 
about whether its unit of work is a whole repo or a sort of
detached branch thingy; its importer think one thing, its
exporter thinks another, and some irritations ensue.

I've wanted to teach reposurgeon to speak svn for a while now.  The
problem is, I've looked at a half-dozen exporters from svn to git
import streams, and they all *suck*.  It seems like everybody gets to
about the same point - just before doing the analysis to map svn
branches to git/hg-like branches - and gives up.  And you can't do
that - in gitspace, if you don't have a theory of what branches are,
you can't get the parent/child relationships right. In svnspace, you
can't even detect tags properly.

Most of these tools (Daniel Barr's svfn-fe, Gustavo Niemeyer's
svn2git, Chris Lee's svn-fast-export.py, a couple random svn2gits on
github and gitorious, some others I've forgotten now) *only work on
linear repos* - they've all got shamefaced comments saying branches
aren't handled yet.

There are only two exception I know of to the lossage. One has you
writing complicated rules in a minilanguage to define the branch
mapping: equal lossage, other direction.  The other is git-svn, which
does a reasonable job on repositories close to standard layout if you
hint at it right, but is really designed for live gatewaying rather than
conversions. Among other things it doesn't lift tags.

When I first shipped reposurgeon, I prodded this list to solve the
problem - have an official exporter.  That didn't happen, so I decided
to solve the problem from my end.  Wrote a zero-configuration
branch-mapping algorithm that should works for 99% of cases and punts
to something usable on the other 1%.  Got it to lift svn tags to real
git tags. I'm ahead of the pack already.  

The only reason I haven't shipped yet is that some weird things
cvs2svn generates give my dumpfile importer indigestion; I'm working
on that, it's the exact reason I need to understand the format
*completely*.

(I should mention here that I tried a different approach first -
befire I wrote the dumpfile parser I was scraping svn repos with a
harness wrapped around the Subversion CLI tools, sort of a replay
attack. Had to abandon that because it was *hideously* slow - over 8
hours to suck in a repo with around 3Kcommits.  And yes, that was the
CLI tools being poky; the stream parser takes about 8 minutes on the
same repo.)

There is really only one even moderately hard problem here, and that
is the branch mapping.  Once you beat that, up-conversion from svn to
a DVCS works very nicely.  You're adding information, not losing it.
(There's one minor exception; Subversion's user-set properties don't
map well to plain git-import streams. You need the bzr properties
extension for that, which git itself chokes on.)

On the other hand, when you *start* in gitspace, mapping back down to
the set of abstractions svn can handle is really lossy.  You have to
throw away the domains on committer names, all the author fields, real
(annotated) tags, and branch merges. DVCS merges don't really map to
Subversion merges at all well; the svn version is more like what git/hg
folks call cherry-picking.

So, if I were to suport writing svn dumpfiles, it would throw away so
much information from import streams that the result would be pathetic.
Functionally, the worst loss would be real branch merges.  That is 
a showstopper, right there.

There's only use case for which the capability to write svn repos from
reposurgeon would make sense, and it's not conversions.  It's
Subversion-to-Subversion repository editing.  Which does tempt me a
little.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric S. Raymond wrote on Tue, Dec 13, 2011 at 13:34:03 -0500:
> Self-defense, I assure you.  I'm attempting to build a better SVN-to-DVCS
> converter than exists anywhere now, and the best way to understand the
> dump format well enough to do that is to document it in detail.

Curious if you also intend to support X-to-SVN conversion in your tool.

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Greg Stein <gs...@gmail.com>:
> >                                                    I have commit
> > privileges on the Subversion repo; I was given them in connection
> > with svncutter.  I'm willing to fix up that file, but want to check
> > that I wouldn't be stepping on any toes by doing so.
> 
> Process-wise, to commit to areas outside of svncutter, just post the patch
> to the dev list, and when you get at least one +1 from a full committer,
> then you can go ahead with your commit of that patch.

Thanks, that's the exact advice I needed.

> Thanks for the detailed review!

Self-defense, I assure you.  I'm attempting to build a better SVN-to-DVCS
converter than exists anywhere now, and the best way to understand the
dump format well enough to do that is to document it in detail.

(Well, that's the way *my* mind works, anyway.  YMMV.)
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Greg Stein <gs...@gmail.com>.

On Dec 13, 2011 2:14 AM, "Eric S. Raymond" <es...@thyrsus.com> wrote:
>
> I have just finished writing a full parser for Subversion dumpfiles.
> The next release of reposurgeon will have the ability to read them
> directly, though not to write them.
>
> In the process, I've looked very closely at the file
>
>
https://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt
>
> and discovered  a number of problems with it.  I have commit
> privileges on the Subversion repo; I was given them in connection
> with svncutter.  I'm willing to fix up that file, but want to check
> that I wouldn't be stepping on any toes by doing so.

Process-wise, to commit to areas outside of svncutter, just post the patch
to the dev list, and when you get at least one +1 from a full committer,
then you can go ahead with your commit of that patch.

Thanks for the detailed review!

Cheers,
-g

Re: Problems with the documentation of Subversion dump format

Posted by Stephen Butler <sb...@elego.de>.

On Dec 13, 2011, at 16:37 , Daniel Shahaf wrote:

> Eric S. Raymond wrote on Tue, Dec 13, 2011 at 09:44:59 -0500:
>> Yes, but in a single copy command?  My experience is that every one copy 
>> operation done from the CLI triggers a commit.
> 
> Not every single 'svn cp' invocation triggers a commit.

i.e., you could run 'svn cp' several times in a working copy, and then
commit once with 'svn ci'.

Steve
--
Stephen Butler | Senior Consultant
elego Software Solutions GmbH
Gustav-Meyer-Allee 25 | 13355 Berlin | Germany
tel: +49 30 2345 8696 | mobile: +49 163 25 45 015
fax: +49 30 2345 8695 | http://www.elegosoft.com
Geschäftsführer: Olaf Wagner | Sitz der Gesellschaft: Berlin
Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric S. Raymond wrote on Tue, Dec 13, 2011 at 13:30:10 -0500:
> Daniel Shahaf <d....@daniel.shahaf.name>:
> > Eric S. Raymond wrote on Tue, Dec 13, 2011 at 09:44:59 -0500:
> > > Yes, but in a single copy command?  My experience is that every one copy 
> > > operation done from the CLI triggers a commit.
> > 
> > Not every single 'svn cp' invocation triggers a commit.
> 
> I've never been able to stop it.  What's the actual trigger condition?
> 

'svn cp' commits iff the last target argument is a URL.

> > And, by the way, dumpfiles need to represent things that cannot be done
> > with plain 'svn'.
> 
> Yeah, I figured that out already from looking at some of the weird stuff
> cvs2svn committed in the back history of NUT-UPS.  This is a point that 
> needs to be clearer in the documentation.

The documentation should say that dumpfiles must be able to represent
everything the public svn_fs.h API allows.  (Or, at least, that subset
of that API that svn_repos.h also exposes... if they differ?)

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Daniel Shahaf <d....@daniel.shahaf.name>:
> Eric S. Raymond wrote on Tue, Dec 13, 2011 at 09:44:59 -0500:
> > Yes, but in a single copy command?  My experience is that every one copy 
> > operation done from the CLI triggers a commit.
> 
> Not every single 'svn cp' invocation triggers a commit.

I've never been able to stop it.  What's the actual trigger condition?

> And, by the way, dumpfiles need to represent things that cannot be done
> with plain 'svn'.

Yeah, I figured that out already from looking at some of the weird stuff
cvs2svn committed in the back history of NUT-UPS.  This is a point that 
needs to be clearer in the documentation.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric S. Raymond wrote on Tue, Dec 13, 2011 at 09:44:59 -0500:
> Yes, but in a single copy command?  My experience is that every one copy 
> operation done from the CLI triggers a commit.

Not every single 'svn cp' invocation triggers a commit.

And, by the way, dumpfiles need to represent things that cannot be done
with plain 'svn'.

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

On Wed, Dec 14, 2011, at 06:36, Eric S. Raymond wrote:
> Daniel Shahaf <d....@daniel.shahaf.name>:
> > Replace requires a node to already exist at the target path.
> > 
> > Add requires a node to not already exist.
> 
> OK, when you say "require", what do you mean?  Just that if these conditions
> fail the node should not be modified?
> 

If these condition fail I believe 'svnadmin load' aborts without
committing the revision they appear in (but with prior revisions having
already abeen commited)

> > (container == node)
> 
> I see what you mean, but I think this is misleading terminology. The thing 
> you are calling a "container" is a history containing a sequence of actions
> each one of which is *described* by a node.
> 

Sorry, no.  I said "node"; what you call a "node" is termed "node-
revision" in the FS.

See subversion/libsvn_fs_base/notes/

> IMO the term "container" isn't a win either; it doesn't convey a
> strong enough sense of continuity over time.
> 
> In the draft I'm working on, I've used the term "flow", and defined a 
> flow as a sequence of actions on a path.

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Daniel Shahaf <d....@daniel.shahaf.name>:
> Replace requires a node to already exist at the target path.
> 
> Add requires a node to not already exist.

OK, when you say "require", what do you mean?  Just that if these conditions
fail the node should not be modified?

> (container == node)

I see what you mean, but I think this is misleading terminology. The thing 
you are calling a "container" is a history containing a sequence of actions
each one of which is *described* by a node.

IMO the term "container" isn't a win either; it doesn't convey a
strong enough sense of continuity over time.

In the draft I'm working on, I've used the term "flow", and defined a 
flow as a sequence of actions on a path.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Eric S. Raymond wrote on Tue, Dec 13, 2011 at 13:25:13 -0500:
> C. Michael Pilato <cm...@collab.net>:
> > > Does a file replace differ in any way from a delete plus add of the new text?
> > 
> > In Subversion, yes.  A replacement is, like an add or a delete, an operation
> > at the node level, not an operation on the contents of that node.  A replace
> > is an addition of a new object[1] -- with its own new line of version
> > control history -- that is coincidental with the removal of some previously
> > existing object that occupied the same path.
> 
> I still don't understand how this differs from a delete followed by an add.
> Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
> and you never have to explain it again.
> 
> When I add a file at a given path, it creates new object with a
> history that is tracked.  When I delete that path, I destroy the 
> container as well as the content.  If I subsequently create a new
> file at the same path, it's a new object with its own history.
> 
> How is a replace different?
> 

Presumably:

Replace requires a node to already exist at the target path.

Add requires a node to not already exist.

For comparison, the svn_delta_editor_t documentation talks about the
relative order delete_entry()/add_file()/add_dir() of the same path in
the same editor drive.

> > [1] Most of the time.  A replacement can have a copyfrom source, in which
> > case its not strictly a new line of history for that object.
> 
> I think I get this part.  When you replace with a copy source, you're
> destroying the container that existed at this path, abd replacing it with 
> a new container that has history extending back through the copy source.  
> Is that correct?
> 

Yes

(container == node)

Re: Problems with the documentation of Subversion dump format

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 12/14/2011 09:47 AM, Eric S. Raymond wrote:
> Heh.  Just to add to the confusion, Daniel says that what I'm calling a 
> "flow" is elswehere called a "node" and that what I'm calling a "node" 
> is elsewhere called a "node-revision".
> 
> I'm not sure how I want to deal with this in the 0.3 draft.  The
> problem with what he's telling me is the correct terminology leaves
> the term "node" pretty overloaded.  Best thing may be to change "node"
> in the document to "node-revision" and leave in "flow" with a note
> indicating that the source code sometimes calls this a "node".

Yup.

A "node-revision" is, in svn-fs-speak, a single node in the giant DAG which
describes the whole of the version history in the repository.  It represents
the state of a file or directory as it existed a given point in version time.

A "node", though, is not a node in that DAG at all, but the term used to
describe a whole set of those nodes (aka "node-revisions") connected in the
DAG by their ancestor IDs.  A "node" describes the line(s) of history of a
single versioned object (file, or directory, including all states thereof
including those resulting from copy operations).

So, yeah, one node is a "node-revision", and a collection of
"node-revisions" is a "node".  We probably could have done a bit better when
naming this stuff...

Fortunately, there is no confusion about one thing:  the term "flow" has not
a single meaning at all in Subversion-speak.  :-)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

C. Michael Pilato <cm...@collab.net>:
> On 12/14/2011 07:11 AM, Eric S. Raymond wrote:
> > Which brings up a question: should a delete on a non-empty directory succeed
> > or fail?
> 
> Succeed.

Thank you, that will go into the 0.3 draft.

> > IMO, part of the reason this stuff is confusing is that your
> > terminology is inadequate; see previous note to Daniel Shahaf. I get
> > what you mean by "container" but I think that label confuses more than
> > it enlightens.  In the draft I'm using the term "flow" for a sequence
> > of actions on a path.
> 
> I agree about the terminology.  I was trying (and failing, it seems) to
> adapt to your own terminology.  What a tangled web we weave...

Heh.  Just to add to the confusion, Daniel says that what I'm calling a 
"flow" is elswehere called a "node" and that what I'm calling a "node" 
is elsewhere called a "node-revision".

I'm not sure how I want to deal with this in the 0.3 draft.  The
problem with what he's telling me is the correct terminology leaves
the term "node" pretty overloaded.  Best thing may be to change "node"
in the document to "node-revision" and leave in "flow" with a note
indicating that the source code sometimes calls this a "node".

> The copyfrom attribute was designed to be valid only on "add" and "replace",
> not "change" (or "delete").  And yes, directory changes may only contain
> properties after the header block -- no text.

Good, that certainly simplifies life. Please have a look at the 0.2 draft
and see what if anything needs fixing.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 12/14/2011 07:11 AM, Eric S. Raymond wrote:
> Which brings up a question: should a delete on a non-empty directory succeed
> or fail?

Succeed.

> IMO, part of the reason this stuff is confusing is that your
> terminology is inadequate; see previous note to Daniel Shahaf. I get
> what you mean by "container" but I think that label confuses more than
> it enlightens.  In the draft I'm using the term "flow" for a sequence
> of actions on a path.

I agree about the terminology.  I was trying (and failing, it seems) to
adapt to your own terminology.  What a tangled web we weave...

> I'm actually pretty sure this is all correct - but it leaves open the question
> of whether "change" can have copyfrom, and what that means in the case
> of directories.
>
> I checked, and "change" is what's used for a normal file content
> modification - see for example the change to bar/foo.c in sussman's 
> example in the notes file. 
> 
> There are a couple of different possibilities here.  One is that change with
> a copyfrom is illegal.  In that case, every directory change is required
> to be a property change, since directory nodes can't have text.  This is
> what my draft currently says.
> 
> Another possibility is that copyfrom does its history-attachment thing and
> the note is afterwards part of two flows.  That would be rather like a
> baby version of a DVCS merge.

The copyfrom attribute was designed to be valid only on "add" and "replace",
not "change" (or "delete").  And yes, directory changes may only contain
properties after the header block -- no text.

>>                This is still an addition of sorts in that the object is newly
>>    added to the set of its parent directory's list of children.
> 
> For what operations is this list of children significant, and how?  Which
> circles back to my first question about D on a directory.

Let's not go here.  I was trying to justify the use of the term "add" in the
context of a copied item (which, in one sense, isn't a new thing at all).
My attempt can only possibly create confusion, so I advise that we all just
forget I said it.  :-)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

C. Michael Pilato <cm...@collab.net>:
>                                                              The "replace"
> action found in the dumpfile is just a compacting of some delete operation
> and a subsequent add or copy into a single verb, and that only because it
> helps sequential processors of the dump stream avoid possibly notifying
> about multiple actions on the same path

Thank you.  That is beautifully clear and I will use some variant of
it it in the draft I am working up.

Which brings up a question: should a delete on a non-empty directory succeed
or fail?

If it should succeed, then R truly is D + A.  If it should fail, then R
lacks a precondition that D + A has.

> (My prior response was the result of my misreading your phrase "delete plus
> add of the new text" as meaning "removing all the contents of the file, and
> then adding all new contents of the same file".  I see now that you were
> talking about "container" operations, not content ones.  Sorry about that.)

No problem.  Easy mistake to make.

> >> [1] Most of the time.  A replacement can have a copyfrom source, in which
> >> case its not strictly a new line of history for that object.
> > 
> > I think I get this part.  When you replace with a copy source, you're
> > destroying the container that existed at this path, abd replacing it with 
> > a new container that has history extending back through the copy source.  
> > Is that correct?
> 
> Yup!

IMO, part of the reason this stuff is confusing is that your
terminology is inadequate; see previous note to Daniel Shahaf. I get
what you mean by "container" but I think that label confuses more than
it enlightens.  In the draft I'm using the term "flow" for a sequence
of actions on a path.

> I was trying to think through the generalities here, too.  I believe they
> boil down to this:
> 
>    "delete" stands alone.  It never has text.  Never has properties.
>    Never has copyfrom.
> 
>    "add" and "replace" can have text if the added object is a file.  The
>    text is the contents of the added object as it appears in the committed
>    revision.  "add" and "replace" of directories can not have text.
> 
>    "add and replace" can have properties -- the set of properties present
>    on the added file/directory in the committed revision.
> 
>    "add and replace" can have copyfrom information, indicating that the
>    "added" object does not truly represent the creation of a new line of
>    history, but is instead a continuation of a pre-existing line of
>    history.  This is still an addition of sorts in that the object is newly
>    added to the set of its parent directory's list of children.
> 
> But I haven't double-thunk that for complete accuracy.

I'm actually pretty sure this is all correct - but it leaves open the question
of whether "change" can have copyfrom, and what that means in the case
of directories.

I checked, and "change" is what's used for a normal file content
modification - see for example the change to bar/foo.c in sussman's 
example in the notes file. 

There are a couple of different possibilities here.  One is that change with
a copyfrom is illegal.  In that case, every directory change is required
to be a property change, since directory nodes can't have text.  This is
what my draft currently says.

Another possibility is that copyfrom does its history-attachment thing and
the note is afterwards part of two flows.  That would be rather like a
baby version of a DVCS merge.

You wrote:

>                This is still an addition of sorts in that the object is newly
>    added to the set of its parent directory's list of children.

For what operations is this list of children significant, and how?  Which
circles back to my first question about D on a directory.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Dec 13, 2011 at 11:52 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> Johan Corveleyn wrote on Tue, Dec 13, 2011 at 22:04:33 +0100:
[...]
>> And even:
>>
>>   $ svn mv some/file.txt some/otherfile.txt
>>   $ svn mv some/otherfile.txt some/file.txt
>>   $ svn ci -m "Replace some/file.txt with a copy of itself."
>>
>> (pre-1.7 this would be a replace without copyfrom, breaking the line
>> of history [1], but that is fixed as of 1.7)
>>
>
> But the fix is client-side, right?  One can still do things like that by
> driving an RA commit editor directly.

Oh yes, that was purely a client-side problem, which was fixed by the
better capabilities of WC-NG. In theory the client could just as well
translate the above series of commands into a simple modification,
instead of a replace with history (copied from itself). But that
doesn't happen in practice :-). Anyway, I guess this is getting off
topic for this thread ...

-- 
Johan

>> [1] http://subversion.tigris.org/issues/show_bug.cgi?id=3429 - "svn mv
>> A B; svn mv B A" generates replace without history

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Johan Corveleyn wrote on Tue, Dec 13, 2011 at 22:04:33 +0100:
> On Tue, Dec 13, 2011 at 8:16 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> > C. Michael Pilato wrote on Tue, Dec 13, 2011 at 14:01:45 -0500:
> >> On 12/13/2011 01:25 PM, Eric S. Raymond wrote:
> >> > C. Michael Pilato <cm...@collab.net>:
> >> >>> Does a file replace differ in any way from a delete plus add of the new text?
> >> >>
> >> >> In Subversion, yes.  A replacement is, like an add or a delete, an operation
> >> >> at the node level, not an operation on the contents of that node.  A replace
> >> >> is an addition of a new object[1] -- with its own new line of version
> >> >> control history -- that is coincidental with the removal of some previously
> >> >> existing object that occupied the same path.
> >> >
> >> > I still don't understand how this differs from a delete followed by an add.
> >> > Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
> >> > and you never have to explain it again.
> >> >
> >> > When I add a file at a given path, it creates new object with a
> >> > history that is tracked.  When I delete that path, I destroy the
> >> > container as well as the content.  If I subsequently create a new
> >> > file at the same path, it's a new object with its own history.
> >> >
> >> > How is a replace different?
> >>
> >> Assume your "delete" and subsequent "add" happens in the same commit, it's
> >> not different at all.  In fact, the Subversion filesystem API doesn't even
> >> recognize a "replace" operation.  There's "delete (file or dir)", there's
> >> "make file" and "make dir", and there's "copy (file or dir)".  The "replace"
> >> action found in the dumpfile is just a compacting of some delete operation
> >> and a subsequent add or copy into a single verb, and that only because it
> >> helps sequential processors of the dump stream avoid possibly notifying
> >> about multiple actions on the same path.  We favor the likes of:
> >>
> >>     R   /some/file.txt
> >>
> >> over:
> >>
> >>     D   /some/file.txt
> >>     A   /some/file.txt
> >>
> >> in output.
> >>
> >> (My prior response was the result of my misreading your phrase "delete plus
> >> add of the new text" as meaning "removing all the contents of the file, and
> >> then adding all new contents of the same file".  I see now that you were
> >> talking about "container" operations, not content ones.  Sorry about that.)
> >>
> >> >> [1] Most of the time.  A replacement can have a copyfrom source, in which
> >> >> case its not strictly a new line of history for that object.
> >> >
> >> > I think I get this part.  When you replace with a copy source, you're
> >> > destroying the container that existed at this path, abd replacing it with
> >> > a new container that has history extending back through the copy source.
> >> > Is that correct?
> >>
> >> Yup!
> >>
> >> I was trying to think through the generalities here, too.  I believe they
> >> boil down to this:
> >>
> >>    "delete" stands alone.  It never has text.  Never has properties.
> >>    Never has copyfrom.
> >>
> >>    "add" and "replace" can have text if the added object is a file.  The
> >>    text is the contents of the added object as it appears in the committed
> >>    revision.  "add" and "replace" of directories can not have text.
> >>
> >>    "add and replace" can have properties -- the set of properties present
> >>    on the added file/directory in the committed revision.
> >>
> >>    "add and replace" can have copyfrom information, indicating that the
> >>    "added" object does not truly represent the creation of a new line of
> >>    history, but is instead a continuation of a pre-existing line of
> >>    history.  This is still an addition of sorts in that the object is newly
> >>    added to the set of its parent directory's list of children.
> >>
> >> But I haven't double-thunk that for complete accuracy.
> >>
> >> > So, everything except a delete can include properties and they all
> >> > work the same way. Correct?
> >>
> >> Yes.
> >>
> >> >>> If a file replace can have a copyfrom source, how does replace with a
> >> >>> copyfrom source differ from add with a copyfrom source?
> >> >>
> >> >> The differ only in the fact that a replace implies the simultaneous deletion
> >> >> of some other object which previously lived at that path.
> >> >
> >> > Got it.  That case I understand, it's how they differ in the non-copyfrom
> >> > case that still confuses me.
> >>
> >> This is replace without copyfrom:
> >>
> >>    $ svn rm some/file.txt
> >>    $ touch some/file.txt
> >>    $ svn add some/file.txt
> >>    $ svn ci -m "Replace some/file.txt with a new file."
> >>
> >> This is replace with copyfrom:
> >>
> >>    $ svn rm some/file.txt
> >>    $ touch some/file.txt
> >>    $ svn copy someother/differentfile.txt some/file.txt
> >>    $ svn ci -m "Replace some/file.txt with a copy of a different file ."
> >
> > And:
> >
> >   $ svn rm some/file.txt
> >   $ touch some/file.txt
> >   $ svn copy some/file.txt@HEAD some/file.txt
> >   $ svn ci -m "Replace some/file.txt with a copy itself."
> 
> And even:
> 
>   $ svn mv some/file.txt some/otherfile.txt
>   $ svn mv some/otherfile.txt some/file.txt
>   $ svn ci -m "Replace some/file.txt with a copy of itself."
> 
> (pre-1.7 this would be a replace without copyfrom, breaking the line
> of history [1], but that is fixed as of 1.7)
> 

But the fix is client-side, right?  One can still do things like that by
driving an RA commit editor directly.

> [1] http://subversion.tigris.org/issues/show_bug.cgi?id=3429 - "svn mv
> A B; svn mv B A" generates replace without history
> 
> -- 
> Johan

Re: Problems with the documentation of Subversion dump format

Posted by Johan Corveleyn <jc...@gmail.com>.

On Tue, Dec 13, 2011 at 8:16 PM, Daniel Shahaf <d....@daniel.shahaf.name> wrote:
> C. Michael Pilato wrote on Tue, Dec 13, 2011 at 14:01:45 -0500:
>> On 12/13/2011 01:25 PM, Eric S. Raymond wrote:
>> > C. Michael Pilato <cm...@collab.net>:
>> >>> Does a file replace differ in any way from a delete plus add of the new text?
>> >>
>> >> In Subversion, yes.  A replacement is, like an add or a delete, an operation
>> >> at the node level, not an operation on the contents of that node.  A replace
>> >> is an addition of a new object[1] -- with its own new line of version
>> >> control history -- that is coincidental with the removal of some previously
>> >> existing object that occupied the same path.
>> >
>> > I still don't understand how this differs from a delete followed by an add.
>> > Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
>> > and you never have to explain it again.
>> >
>> > When I add a file at a given path, it creates new object with a
>> > history that is tracked.  When I delete that path, I destroy the
>> > container as well as the content.  If I subsequently create a new
>> > file at the same path, it's a new object with its own history.
>> >
>> > How is a replace different?
>>
>> Assume your "delete" and subsequent "add" happens in the same commit, it's
>> not different at all.  In fact, the Subversion filesystem API doesn't even
>> recognize a "replace" operation.  There's "delete (file or dir)", there's
>> "make file" and "make dir", and there's "copy (file or dir)".  The "replace"
>> action found in the dumpfile is just a compacting of some delete operation
>> and a subsequent add or copy into a single verb, and that only because it
>> helps sequential processors of the dump stream avoid possibly notifying
>> about multiple actions on the same path.  We favor the likes of:
>>
>>     R   /some/file.txt
>>
>> over:
>>
>>     D   /some/file.txt
>>     A   /some/file.txt
>>
>> in output.
>>
>> (My prior response was the result of my misreading your phrase "delete plus
>> add of the new text" as meaning "removing all the contents of the file, and
>> then adding all new contents of the same file".  I see now that you were
>> talking about "container" operations, not content ones.  Sorry about that.)
>>
>> >> [1] Most of the time.  A replacement can have a copyfrom source, in which
>> >> case its not strictly a new line of history for that object.
>> >
>> > I think I get this part.  When you replace with a copy source, you're
>> > destroying the container that existed at this path, abd replacing it with
>> > a new container that has history extending back through the copy source.
>> > Is that correct?
>>
>> Yup!
>>
>> I was trying to think through the generalities here, too.  I believe they
>> boil down to this:
>>
>>    "delete" stands alone.  It never has text.  Never has properties.
>>    Never has copyfrom.
>>
>>    "add" and "replace" can have text if the added object is a file.  The
>>    text is the contents of the added object as it appears in the committed
>>    revision.  "add" and "replace" of directories can not have text.
>>
>>    "add and replace" can have properties -- the set of properties present
>>    on the added file/directory in the committed revision.
>>
>>    "add and replace" can have copyfrom information, indicating that the
>>    "added" object does not truly represent the creation of a new line of
>>    history, but is instead a continuation of a pre-existing line of
>>    history.  This is still an addition of sorts in that the object is newly
>>    added to the set of its parent directory's list of children.
>>
>> But I haven't double-thunk that for complete accuracy.
>>
>> > So, everything except a delete can include properties and they all
>> > work the same way. Correct?
>>
>> Yes.
>>
>> >>> If a file replace can have a copyfrom source, how does replace with a
>> >>> copyfrom source differ from add with a copyfrom source?
>> >>
>> >> The differ only in the fact that a replace implies the simultaneous deletion
>> >> of some other object which previously lived at that path.
>> >
>> > Got it.  That case I understand, it's how they differ in the non-copyfrom
>> > case that still confuses me.
>>
>> This is replace without copyfrom:
>>
>>    $ svn rm some/file.txt
>>    $ touch some/file.txt
>>    $ svn add some/file.txt
>>    $ svn ci -m "Replace some/file.txt with a new file."
>>
>> This is replace with copyfrom:
>>
>>    $ svn rm some/file.txt
>>    $ touch some/file.txt
>>    $ svn copy someother/differentfile.txt some/file.txt
>>    $ svn ci -m "Replace some/file.txt with a copy of a different file ."
>
> And:
>
>   $ svn rm some/file.txt
>   $ touch some/file.txt
>   $ svn copy some/file.txt@HEAD some/file.txt
>   $ svn ci -m "Replace some/file.txt with a copy itself."

And even:

  $ svn mv some/file.txt some/otherfile.txt
  $ svn mv some/otherfile.txt some/file.txt
  $ svn ci -m "Replace some/file.txt with a copy of itself."

(pre-1.7 this would be a replace without copyfrom, breaking the line
of history [1], but that is fixed as of 1.7)

[1] http://subversion.tigris.org/issues/show_bug.cgi?id=3429 - "svn mv
A B; svn mv B A" generates replace without history

-- 
Johan

Re: Problems with the documentation of Subversion dump format

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

C. Michael Pilato wrote on Tue, Dec 13, 2011 at 14:01:45 -0500:
> On 12/13/2011 01:25 PM, Eric S. Raymond wrote:
> > C. Michael Pilato <cm...@collab.net>:
> >>> Does a file replace differ in any way from a delete plus add of the new text?
> >>
> >> In Subversion, yes.  A replacement is, like an add or a delete, an operation
> >> at the node level, not an operation on the contents of that node.  A replace
> >> is an addition of a new object[1] -- with its own new line of version
> >> control history -- that is coincidental with the removal of some previously
> >> existing object that occupied the same path.
> > 
> > I still don't understand how this differs from a delete followed by an add.
> > Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
> > and you never have to explain it again.
> > 
> > When I add a file at a given path, it creates new object with a
> > history that is tracked.  When I delete that path, I destroy the 
> > container as well as the content.  If I subsequently create a new
> > file at the same path, it's a new object with its own history.
> > 
> > How is a replace different?
> 
> Assume your "delete" and subsequent "add" happens in the same commit, it's
> not different at all.  In fact, the Subversion filesystem API doesn't even
> recognize a "replace" operation.  There's "delete (file or dir)", there's
> "make file" and "make dir", and there's "copy (file or dir)".  The "replace"
> action found in the dumpfile is just a compacting of some delete operation
> and a subsequent add or copy into a single verb, and that only because it
> helps sequential processors of the dump stream avoid possibly notifying
> about multiple actions on the same path.  We favor the likes of:
> 
>     R   /some/file.txt
> 
> over:
> 
>     D   /some/file.txt
>     A   /some/file.txt
> 
> in output.
> 
> (My prior response was the result of my misreading your phrase "delete plus
> add of the new text" as meaning "removing all the contents of the file, and
> then adding all new contents of the same file".  I see now that you were
> talking about "container" operations, not content ones.  Sorry about that.)
> 
> >> [1] Most of the time.  A replacement can have a copyfrom source, in which
> >> case its not strictly a new line of history for that object.
> > 
> > I think I get this part.  When you replace with a copy source, you're
> > destroying the container that existed at this path, abd replacing it with 
> > a new container that has history extending back through the copy source.  
> > Is that correct?
> 
> Yup!
> 
> I was trying to think through the generalities here, too.  I believe they
> boil down to this:
> 
>    "delete" stands alone.  It never has text.  Never has properties.
>    Never has copyfrom.
> 
>    "add" and "replace" can have text if the added object is a file.  The
>    text is the contents of the added object as it appears in the committed
>    revision.  "add" and "replace" of directories can not have text.
> 
>    "add and replace" can have properties -- the set of properties present
>    on the added file/directory in the committed revision.
> 
>    "add and replace" can have copyfrom information, indicating that the
>    "added" object does not truly represent the creation of a new line of
>    history, but is instead a continuation of a pre-existing line of
>    history.  This is still an addition of sorts in that the object is newly
>    added to the set of its parent directory's list of children.
> 
> But I haven't double-thunk that for complete accuracy.
> 
> > So, everything except a delete can include properties and they all
> > work the same way. Correct?
> 
> Yes.
> 
> >>> If a file replace can have a copyfrom source, how does replace with a
> >>> copyfrom source differ from add with a copyfrom source? 
> >>
> >> The differ only in the fact that a replace implies the simultaneous deletion
> >> of some other object which previously lived at that path.
> > 
> > Got it.  That case I understand, it's how they differ in the non-copyfrom 
> > case that still confuses me.
> 
> This is replace without copyfrom:
> 
>    $ svn rm some/file.txt
>    $ touch some/file.txt
>    $ svn add some/file.txt
>    $ svn ci -m "Replace some/file.txt with a new file."
> 
> This is replace with copyfrom:
> 
>    $ svn rm some/file.txt
>    $ touch some/file.txt
>    $ svn copy someother/differentfile.txt some/file.txt
>    $ svn ci -m "Replace some/file.txt with a copy of a different file ."

And:

   $ svn rm some/file.txt
   $ touch some/file.txt
   $ svn copy some/file.txt@HEAD some/file.txt
   $ svn ci -m "Replace some/file.txt with a copy itself."

Re: Problems with the documentation of Subversion dump format

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 12/13/2011 01:25 PM, Eric S. Raymond wrote:
> C. Michael Pilato <cm...@collab.net>:
>>> Does a file replace differ in any way from a delete plus add of the new text?
>>
>> In Subversion, yes.  A replacement is, like an add or a delete, an operation
>> at the node level, not an operation on the contents of that node.  A replace
>> is an addition of a new object[1] -- with its own new line of version
>> control history -- that is coincidental with the removal of some previously
>> existing object that occupied the same path.
> 
> I still don't understand how this differs from a delete followed by an add.
> Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
> and you never have to explain it again.
> 
> When I add a file at a given path, it creates new object with a
> history that is tracked.  When I delete that path, I destroy the 
> container as well as the content.  If I subsequently create a new
> file at the same path, it's a new object with its own history.
> 
> How is a replace different?

Assume your "delete" and subsequent "add" happens in the same commit, it's
not different at all.  In fact, the Subversion filesystem API doesn't even
recognize a "replace" operation.  There's "delete (file or dir)", there's
"make file" and "make dir", and there's "copy (file or dir)".  The "replace"
action found in the dumpfile is just a compacting of some delete operation
and a subsequent add or copy into a single verb, and that only because it
helps sequential processors of the dump stream avoid possibly notifying
about multiple actions on the same path.  We favor the likes of:

    R   /some/file.txt

over:

    D   /some/file.txt
    A   /some/file.txt

in output.

(My prior response was the result of my misreading your phrase "delete plus
add of the new text" as meaning "removing all the contents of the file, and
then adding all new contents of the same file".  I see now that you were
talking about "container" operations, not content ones.  Sorry about that.)

>> [1] Most of the time.  A replacement can have a copyfrom source, in which
>> case its not strictly a new line of history for that object.
> 
> I think I get this part.  When you replace with a copy source, you're
> destroying the container that existed at this path, abd replacing it with 
> a new container that has history extending back through the copy source.  
> Is that correct?

Yup!

I was trying to think through the generalities here, too.  I believe they
boil down to this:

   "delete" stands alone.  It never has text.  Never has properties.
   Never has copyfrom.

   "add" and "replace" can have text if the added object is a file.  The
   text is the contents of the added object as it appears in the committed
   revision.  "add" and "replace" of directories can not have text.

   "add and replace" can have properties -- the set of properties present
   on the added file/directory in the committed revision.

   "add and replace" can have copyfrom information, indicating that the
   "added" object does not truly represent the creation of a new line of
   history, but is instead a continuation of a pre-existing line of
   history.  This is still an addition of sorts in that the object is newly
   added to the set of its parent directory's list of children.

But I haven't double-thunk that for complete accuracy.

> So, everything except a delete can include properties and they all
> work the same way. Correct?

Yes.

>>> If a file replace can have a copyfrom source, how does replace with a
>>> copyfrom source differ from add with a copyfrom source? 
>>
>> The differ only in the fact that a replace implies the simultaneous deletion
>> of some other object which previously lived at that path.
> 
> Got it.  That case I understand, it's how they differ in the non-copyfrom 
> case that still confuses me.

This is replace without copyfrom:

   $ svn rm some/file.txt
   $ touch some/file.txt
   $ svn add some/file.txt
   $ svn ci -m "Replace some/file.txt with a new file."

This is replace with copyfrom:

   $ svn rm some/file.txt
   $ touch some/file.txt
   $ svn copy someother/differentfile.txt some/file.txt
   $ svn ci -m "Replace some/file.txt with a copy of a different file ."

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

C. Michael Pilato <cm...@collab.net>:
> > Does a file replace differ in any way from a delete plus add of the new text?
> 
> In Subversion, yes.  A replacement is, like an add or a delete, an operation
> at the node level, not an operation on the contents of that node.  A replace
> is an addition of a new object[1] -- with its own new line of version
> control history -- that is coincidental with the removal of some previously
> existing object that occupied the same path.

I still don't understand how this differs from a delete followed by an add.
Explain it to me like I'm reallllyyy stuuupid, please, so I can document it
and you never have to explain it again.

When I add a file at a given path, it creates new object with a
history that is tracked.  When I delete that path, I destroy the 
container as well as the content.  If I subsequently create a new
file at the same path, it's a new object with its own history.

How is a replace different?

> [1] Most of the time.  A replacement can have a copyfrom source, in which
> case its not strictly a new line of history for that object.

I think I get this part.  When you replace with a copy source, you're
destroying the container that existed at this path, abd replacing it with 
a new container that has history extending back through the copy source.  
Is that correct?

> > Can a replace include a property section?
> 
> Yes.

So, everything except a delete can include properties and they all
work the same way. Correct?
 
> > Does a replace always have text associated with it, or can it have a
> > copyfrom source?
> 
> You can have a replace with a copyfrom source (a "replace with history", as
> we call it).  You can even have a replace with a copyfrom source *and* text,
> such as would result from this on the client side:
> 
>   $ svn rm dir/file.txt
>   $ svn cp otherdir/otherfile.txt dir/file.txt
>   $ echo "Replacement text" > dir/file.txt
>   $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt\
>   and replace its text, too."

I'll include this in the document I'm working up.

> > If a file replace can have a copyfrom source, how does replace with a
> > copyfrom source differ from add with a copyfrom source? 
> 
> The differ only in the fact that a replace implies the simultaneous deletion
> of some other object which previously lived at that path.

Got it.  That case I understand, it's how they differ in the non-copyfrom 
case that still confuses me.

> > How does a "change" differ from a "replace"?  My guess is that
> > "change" is issued for nodes that are pure property changes with no
> > file content changes; is this correct?
> 
> You are correct.

OK. 

I'm working up a second, more formal draft of the dumpfile description.
I'll post it here for review.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 12/13/2011 09:44 AM, Eric S. Raymond wrote:
> Philip Martin <ph...@wandisco.com>:
>> esr@thyrsus.com (Eric S. Raymond) writes:
>>
>>> # The "replace" action [?is only issued with directory copies, and?]
>>> # signifies that the existing contents of the directory should be
>>> # removed before the copy.
>>
>> Replace applies to files as well.
> 
> Does a file replace differ in any way from a delete plus add of the new text?

In Subversion, yes.  A replacement is, like an add or a delete, an operation
at the node level, not an operation on the contents of that node.  A replace
is an addition of a new object[1] -- with its own new line of version
control history -- that is coincidental with the removal of some previously
existing object that occupied the same path.

[1] Most of the time.  A replacement can have a copyfrom source, in which
case its not strictly a new line of history for that object.

> Can a replace include a property section?

Yes.

> Does a replace always have text associated with it, or can it have a
> copyfrom source?

You can have a replace with a copyfrom source (a "replace with history", as
we call it).  You can even have a replace with a copyfrom source *and* text,
such as would result from this on the client side:

  $ svn rm dir/file.txt
  $ svn cp otherdir/otherfile.txt dir/file.txt
  $ echo "Replacement text" > dir/file.txt
  $ svn ci -m "Replace dir/file.txt with a copy of otherdir/otherfile.txt\
  and replace its text, too."

> If a file replace can have a copyfrom source, how does replace with a
> copyfrom source differ from add with a copyfrom source? 

The differ only in the fact that a replace implies the simultaneous deletion
of some other object which previously lived at that path.

> How does a "change" differ from a "replace"?  My guess is that
> "change" is issued for nodes that are pure property changes with no
> file content changes; is this correct?

You are correct.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Problems with the documentation of Subversion dump format

Posted by "Eric S. Raymond" <es...@thyrsus.com>.

Philip Martin <ph...@wandisco.com>:
> esr@thyrsus.com (Eric S. Raymond) writes:
> 
> > # The "replace" action [?is only issued with directory copies, and?]
> > # signifies that the existing contents of the directory should be
> > # removed before the copy.
> 
> Replace applies to files as well.

Does a file replace differ in any way from a delete plus add of the new text?

Can a replace include a property section?

Does a replace always have text associated with it, or can it have a
copyfrom source?

If a file replace can have a copyfrom source, how does replace with a
copyfrom source differ from add with a copyfrom source? 

How does a "change" differ from a "replace"?  My guess is that
"change" is issued for nodes that are pure property changes with no
file content changes; is this correct?

For each of the actions, it needs to be clear

* whether it can have a text section
* whether it can have a copyfrom source
* whether it can have a property section
* whether it can be a file operation, a directory operation, or both.

The yes-or-no answers to those questions alone imply 36 different
possible semantic cases.  There are probably simplifying rules like
"anything that can have a text section can have a file copyfrom
source" or "any file operation except a delete can have a proerty
section", but the existing notes don't really make any of these clear.

To understand this, I'm reduced to groveling through megabytes of
dumpfiles, trying to figure out all the corner cases by eyeball.  This
means the documentation isn't good enough; the syntax of dumpfiles is
beautifully designed but the semantics is pretty murky.

> > # Interpreting copyfrom_path for file copies is straightforward; the
> > # target pathname gets the contents of the source pathname.
> > #
> > # Directory copies (the primitive beneath branching and tagging) are
> > # tricky.  For each source path under the source directory, a new path
> > # is generated by removing the head segment of the pathname that is
> > # the source directory.  That new path under the target directory gets
> > # the content of the source path.
> 
> Not sure what this means.  This copies A/B/C to X/Y/Z:
> 
> Node-path: X/Y/Z
> Node-kind: dir
> Node-action: add
> Node-copyfrom-rev: 10
> Node-copyfrom-path: A/B/C

I meant that, given this:

> Node-path: x/y/z
> Node-kind: dir
> Node-action: add
> Node-copyfrom-rev: 10
> Node-copyfrom-path: a/b/c

a file a/b/c/d will be copied to x/y/z/d.  The "a/b/c" is what I was calling
the "head segment".

If that's not the way it's supposed to work, how is it supposed to worl?

> > # A single revision may include multiple copyfrom nodes, even multiple
> > # copyfroms to the same directory, even mixed directory and file copies
> > # to the same directory; [?Subversion client tools never generate such
> > # mixed copies, but?] I have seen the results of cvs2svn doing it. 
> 
> Not sure what "mixed directory and file copies to the same directory"
> means or why Subversion clients would be restricted.  Given directories
> D1 and D2 and a file F it's trivial to copy D1 to D2/Dnew and F to
> D2/Fnew.

Yes, but in a single copy command?  My experience is that every one copy 
operation done from the CLI triggers a commit.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Re: Problems with the documentation of Subversion dump format

Posted by Philip Martin <ph...@wandisco.com>.

esr@thyrsus.com (Eric S. Raymond) writes:

> # The "replace" action [?is only issued with directory copies, and?]
> # signifies that the existing contents of the directory should be
> # removed before the copy.

Replace applies to files as well.

> # Interpreting copyfrom_path for file copies is straightforward; the
> # target pathname gets the contents of the source pathname.
> #
> # Directory copies (the primitive beneath branching and tagging) are
> # tricky.  For each source path under the source directory, a new path
> # is generated by removing the head segment of the pathname that is
> # the source directory.  That new path under the target directory gets
> # the content of the source path.

Not sure what this means.  This copies A/B/C to X/Y/Z:

Node-path: X/Y/Z
Node-kind: dir
Node-action: add
Node-copyfrom-rev: 10
Node-copyfrom-path: A/B/C

> # A single revision may include multiple copyfrom nodes, even multiple
> # copyfroms to the same directory, even mixed directory and file copies
> # to the same directory; [?Subversion client tools never generate such
> # mixed copies, but?] I have seen the results of cvs2svn doing it. 

Not sure what "mixed directory and file copies to the same directory"
means or why Subversion clients would be restricted.  Given directories
D1 and D2 and a file F it's trivial to copy D1 to D2/Dnew and F to
D2/Fnew.

-- 
uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com