You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2013/08/15 17:01:55 UTC

Semantics of Move

I propose the following logical semantics of the versioned move
operation that is the basis of move tracking, independent of any
implementation.

A versioned move of the node with node id “N”, with respect to two
revisions rX and rY (X < Y), shall mean:

    * Same node id. A node with node id N exists in rX and in rY. It
is “the same node”. It therefore has the same node kind. It may have
content modifications.

    * No gap. There cannot be a gap in the range of revisions: node id
N exists in every revision rX, rX+1, …, rY-1, rY.  (Contrast with
copy-and-delete, where there can be a gap between the delete and the
copy.)  (The possible “resurrection” extension to these semantics
would permit a gap.)

    * Move and/or rename. Node N has either or both of: a different
name (base name) in rX than in rY; and/or a different parent (parent
directory node id) in rX than in rY.

    * Children follow. If N is a directory, each child (recursively)
of N in rX remains a child of N in rY, with the same name, unless it
is separately moved or deleted. Any or all of the children can be
separately moved within or outside the subtree at N, at the same time
as N is moved.

    * No null move. An attempted move which does not change the node's
name or its parent node, with or without a modification, is not
distinguished from a normal succession of history.

=== Move versus Rename ===

In the versioned data model semantics, “move” refers to a change of
parent node and/or a change of name.

At a higher level of semantics, for example when resolving conflicts
during merge, it can be useful to distinguish between renaming and
moving to a different parent node.

=== Can't Move a Child of a Copy ===

Moving a child of a copy, within the same revision, is not tracked: it
is an unversioned operation.

A versioned "move" takes a node that existed in the previous revision
and places it in a new location.  A copy, however, always creates new
nodes, conceptually, even if the internal representation is a "lazy
copy" pointer to the old node.  Moving a child node therefore is a
rearrangement of the new content. It is semantically the same as
deleting the child node and creating a copy of it somewhere else.
Compare with copying a node and then moving that copy somewhere else.

If we perform a copy and then move a child of it, either in a WC or in
a repository, this should create a copy with a deleted child, and then
another copy somewhere else which is the "moved" child in its new
path.

Is this acceptable?  This means that combining the two separately
committed changes "copy" and "move a child" into a single commit will
result in semantic data loss, which we are trying to avoid.


Thoughts?

- Julian

Re: Semantics of Move

Posted by Julian Foad <ju...@btopenworld.com>.

Branko Čibej wrote:
> On 15.08.2013 17:01, Julian Foad wrote:
>>  Is this acceptable?
> 
> Looks OK, from a quick review.
> 
>> This means that combining the two separately committed
>> changes "copy" and "move a child" into a single commit will
>> result in semantic data loss, which we are trying to avoid.
> 
> Not really. The copied+moved child node will still have copyfrom
> information linking it to the source of the copy. That doesn't trivially
> tell us how the node came to be at the new location, but there are other
> cases where that's not tracked (e.g., sequential local renames followed
> by a commit; only the result will be tracked, not the intermediate
> (wc-local) states).

Ah, yes: the loss of information about intermediate states does not matter, because we're talking about squashing a series of revisions into a single revision, so by definition information about the intermediate states is to be lost.  All that matters is the relationship between the final state and the initial state, and the (user-level) meaning of that relationship can be described perfectly well without saying "move", in the form "the subtree at path P@X is copied to path P1@Y, except for its child C which is copied to path P2/C@Y instead".  That makes me happy.

- Julian

Re: Semantics of Move

Posted by Branko Čibej <br...@wandisco.com>.

On 15.08.2013 17:01, Julian Foad wrote:
> Is this acceptable?

Looks OK, from a quick review.

> This means that combining the two separately
> committed changes "copy" and "move a child" into a single commit will
> result in semantic data loss, which we are trying to avoid.

Not really. The copied+moved child node will still have copyfrom
information linking it to the source of the copy. That doesn't trivially
tell us how the node came to be at the new location, but there are other
cases where that's not tracked (e.g., sequential local renames followed
by a commit; only the result will be tracked, not the intermediate
(wc-local) states).

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: Semantics of Move

Posted by Julian Foad <ju...@btopenworld.com>.

On 2013-08-16, Blair Zajac wrote:
> Julian Foad wrote:
>>>> On 08/15/2013 08:01 AM, Julian Foad wrote:
>>>>> I propose the following logical semantics of the versioned move
>>>>> operation that is the basis of move tracking, independent of any
>>>>> implementation.
>>>>> 
>>>>> A versioned move of the node with node id "N", with respect to two
>>>>> revisions rX and rY (X < Y), shall mean:
>>>>> 
>>>>>   * Same node id. A node with node id N exists in rX and in rY. It
>>>>>   is "the same node". It therefore has the same node kind. It may have
>>>>>   content modifications.
>> 
>> The key that I am referring to in the description of moves is (node-id,
>> copy-id).  Perhaps I will call it "node-copy-id" for short.  I just
>> checked and that term currently does not appear in the source tree.  I'll
>> update my docs.  Thanks for pointing out the confusion

In discussion yesterday I realized that the (node-id, copy-id) pair from the filesystem is still not what I want to refer to in this definition of semantics, because of lazy copies.  When we copy a directory, we don't assign a new copy-id to each child right away, but only if and when the child is modified.  I want to define the move semantics in terms of some (conceptual) id that is always new for each copied path, including all the unmodified children of a copied directory.

Therefore I've reworked the definition and also expanded and rewritten some other parts of it.  The result is at <http://wiki.apache.org/subversion/MoveDev/MoveDev#Move_Semantics> and repeated in plain text below.

----------

This section specifies the logical semantics of the versioned move operation that is the basis of move tracking, independent of any implementation.

== Node-Line-Id ==

Assume that each versioned path in a revision has an identifier that we will call its node-line-id.  The node-line-id need not physically exist: it is a concept used in the definition of moves but not necessarily in the implementation.

The node-line-id is preserved when the content of a node is modified.  A new node-line-id is assigned to every new node that is created by addition or by copying, including when it replaces a previous node at the same path.  This new node-line-id is unique within the whole repository.  Within any given revision, each node-line-id is unique among all the paths.

The node-line-id is similar to the (node-id, copy-id) tuple in the existing Subversion filesystems, except that the lazy-copy mechanism does not assign a new copy-id to a child of a copy until that child (or one of its descendants) is modified.  Therefore an unmodified child of a copy has the same (node-id, copy-id) as the corresponding child path of the copy source, whereas (by definition) it has a new node-line-id.

Let the term node-line refer to the set of PATH@REV locations that have a given node-line-id.

== Definition of Move ==

Given a node-line N and two revisions rX and rY (X < Y), the definition of N being moved in rY with respect to rX is:

    * Same node-line.  A node-line with node-line-id N exists at path PN,X in rX and at path PN,Y in rY.  It is the same node-line, and so has the same node kind.  Its content may differ.

    * Move and/or rename.  The node-line N has either or both of: a different name (base name) in rX than in rY; and/or a different parent (parent directory node-line-id) in rX than in rY.  Thus, the paths PN,X and PN,Y typically differ but may be the same.

    * No gap.  There cannot be a gap in the range of revisions: node-line-id N exists in every revision rX, rX+1, ..., rY-1, rY.  Contrast with copy-and-delete, where there can be a gap between the delete and the copy.  The possible "resurrection" extension to the move semantics would permit a gap.

    * Children follow.  If N is a directory, each child (recursively) of N in rX remains a child of N in rY, with the same name, unless it is separately moved or deleted.  Any or all of the children can be separately moved within or outside the subtree at N, at the same time as N is moved.

== Properties of Move ==

Some properties of the move relationship are:

    * Unique. move(A@X,B@Y) and move(A@X,C@Y) cannot both be true.

    * Transitive. move(A@X,B@Y) followed by move(B@Y,C@Z) collapses to move(A@X,C@Z).

    * Reversible. move(A@X,B@Y) followed by move(B@Y,A@Z) collapses to no-move.

    * Time-symmetric. move(A@X,B@Y) is symmetric with the time-reversed relationship  move(B@Y,A@X).

    * No null move. An attempted move which does not change the node's name or its parent node, with or without a modification, is not distinguished from a normal succession of history.

In the notation A@X, A represents a parent directory node-line-id and child name (rather than a full path), X represents a revision number, and A != B, and X != Y, etc.

== Move versus Rename ==

In the versioned data model semantics, "move" refers to a change of parent node and/or a change of name.

At a higher level of semantics, for example when resolving conflicts during merge, it can be useful to distinguish between renaming and moving to a different parent node.

== Can't Move a Child of a Copy ==

Moving a child of a copy, within the same revision, is not tracked: it is an unversioned operation.

A versioned "move" takes a node that existed in the previous revision and places it in a new location.  A copy, however, always creates new nodes, conceptually, even if the internal representation is a "lazy copy" pointer to the old node.  Moving a child node therefore is a rearrangement of the new content.  It is semantically the same as deleting the child node and creating a copy of it somewhere else.  Compare with copying a node and then moving that copy somewhere else.

If we perform a copy and then move a child of it, either in a WC or in a repository, this should create a copy with a deleted child, and then another copy somewhere else which is the "moved" child in its new location.  We can describe the relationship between the initial and final states perfectly well without saying "move", in the form "the subtree at path P@X is copied to path P1@Y, except for its child C which is copied to path P2/C@Y instead".  Thus there is no loss of semantic information despite the absence of "move" in the result.

== Resurrecting a Deleted Node ==

A possible extension to the move semantics would be to allow a previously deleted node to be resurrected at the same location or at a different location.

The mental model is this.  When a node is deleted, it is unlinked from the versioned tree, but its content continues to exist in the repository.  When the node is resurrected, a link to that last version of its content is put into the versioned tree, like undoing the delete.  The new link can be made anywhere in the versioned tree, like undoing the delete and moving the node at the same time.

This is not merely a blue-sky curiosity: it may necessary in order to ensure logical completeness.  For example, let's say the node initially at path A/foo is moved to B/foo and then back to A/foo.  If we create a branch C from A, and are continuously merging all changes under A into C, then C/foo will be deleted and will later be recreated.  Or if 'svnsync' is replicating only subtree 'A' of repository R1 into repository R2, then in repository R2 we will see A/foo disappear and then reappear.  In cases such as these, the 'reappearance' should be modeled as a resurrection; if it is modeled as a plain add or as a copy, it will not have the correct semantics of being the 'same' node that it was before. 

------------------------------

- Julian

Re: Semantics of Move

Posted by Blair Zajac <bl...@orcaware.com>.

On 08/16/2013 04:14 AM, Julian Foad wrote:
>
> Branko Čibej wrote:
>
>> On 16.08.2013 04:48, Blair Zajac wrote:
>>>   On 08/15/2013 08:01 AM, Julian Foad wrote:
>>>>   I propose the following logical semantics of the versioned move
>>>>   operation that is the basis of move tracking, independent of any
>>>>   implementation.
>>>>
>>>>   A versioned move of the node with node id “N”, with respect to two
>>>>   revisions rX and rY (X < Y), shall mean:
>>>>
>>>>        * Same node id. A node with node id N exists in rX and in rY. It
>>>>   is “the same node”. It therefore has the same node kind. It may have
>>>>   content modifications.
>>>
>>>   If the contents change then it should get a new node-id.
> [...]
>>
>> You apparently misunderstand what a node-id is. It is not the primary
>> key to a revision of the node. [...] the primary key is:
>>
>>      (node-id, copy-id, txn-id)
>
> To be fair, I was sloppy in my writing, and we have historically been sloppy with the term node-id.  For example, the method root_vtable_t.node_id() returns a svn_fs_id_t which is a node-revision id (node-id, copy-id, txn-id), not a node-id.
>
> The key that I am referring to in the description of moves is (node-id, copy-id).  Perhaps I will call it "node-copy-id" for short.  I just checked and that term currently does not appear in the source tree.  I'll update my docs.  Thanks for pointing out the confusion

Thanks guys for clarifying, that makes me feel much better.

Blair

Re: Semantics of Move

Posted by Julian Foad <ju...@btopenworld.com>.

Branko Čibej wrote:

> On 16.08.2013 04:48, Blair Zajac wrote:
>>  On 08/15/2013 08:01 AM, Julian Foad wrote:
>>>  I propose the following logical semantics of the versioned move
>>>  operation that is the basis of move tracking, independent of any
>>>  implementation.
>>> 
>>>  A versioned move of the node with node id “N”, with respect to two
>>>  revisions rX and rY (X < Y), shall mean:
>>> 
>>>       * Same node id. A node with node id N exists in rX and in rY. It
>>>  is “the same node”. It therefore has the same node kind. It may have
>>>  content modifications.
>> 
>>  If the contents change then it should get a new node-id.
[...]
> 
> You apparently misunderstand what a node-id is. It is not the primary
> key to a revision of the node. [...] the primary key is:
> 
>     (node-id, copy-id, txn-id)

To be fair, I was sloppy in my writing, and we have historically been sloppy with the term node-id.  For example, the method root_vtable_t.node_id() returns a svn_fs_id_t which is a node-revision id (node-id, copy-id, txn-id), not a node-id.

The key that I am referring to in the description of moves is (node-id, copy-id).  Perhaps I will call it "node-copy-id" for short.  I just checked and that term currently does not appear in the source tree.  I'll update my docs.  Thanks for pointing out the confusion.

- Julian

Re: Semantics of Move

Posted by Branko Čibej <br...@wandisco.com>.

On 16.08.2013 04:48, Blair Zajac wrote:
> On 08/15/2013 08:01 AM, Julian Foad wrote:
>> I propose the following logical semantics of the versioned move
>> operation that is the basis of move tracking, independent of any
>> implementation.
>>
>> A versioned move of the node with node id “N”, with respect to two
>> revisions rX and rY (X < Y), shall mean:
>>
>>      * Same node id. A node with node id N exists in rX and in rY. It
>> is “the same node”. It therefore has the same node kind. It may have
>> content modifications.
>
> If the contents change then it should get a new node-id.
>
> I have a vested stake in this since our asset management system backed
> by svn caches in memcached the result of many svn_fs_*() functions by
> node-id.  Our system does three cache lookups:
>
> time -> revision
> (revision, path) -> node_id
> (node_id, function_name) -> result
>
> Doing this allows me to take a lot of load off the svn backend RPC server.

You apparently misunderstand what a node-id is. It is not the primary
key to a revision of the node. In our current numbering scheme (and in
any future planned such scheme that I can currently think of), the
primary key is:

    (node-id, copy-id, txn-id)

(you can think of the last element as the revision id, but the primary
key has to remain unique in the presence of uncommitted transactions).

Given a node identified by (node-id, copy-id, txn-id), and ignoring the
fact that our current copy-id inheritance cannot support moves in
general, copy and move differ in exactly one particular:

  * copy creates (node-id, new-copy-id, new-txn-id)
  * move creates (node-id, copy-id, new-txn-id)

In other words, both copy and move result in a new primary key for a
node revision, but copy creates a new branch whereas move does not.

I expect that your system uses the primary node-revision key, not the
node-id.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: Semantics of Move

Posted by Blair Zajac <bl...@orcaware.com>.

On 08/15/2013 08:01 AM, Julian Foad wrote:
> I propose the following logical semantics of the versioned move
> operation that is the basis of move tracking, independent of any
> implementation.
>
> A versioned move of the node with node id “N”, with respect to two
> revisions rX and rY (X < Y), shall mean:
>
>      * Same node id. A node with node id N exists in rX and in rY. It
> is “the same node”. It therefore has the same node kind. It may have
> content modifications.

If the contents change then it should get a new node-id.

I have a vested stake in this since our asset management system backed 
by svn caches in memcached the result of many svn_fs_*() functions by 
node-id.  Our system does three cache lookups:

time -> revision
(revision, path) -> node_id
(node_id, function_name) -> result

Doing this allows me to take a lot of load off the svn backend RPC server.

Blair