You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Erik Huelsmann <eh...@gmail.com> on 2010/07/10 21:55:33 UTC

NODE_DATA (aka fourth tree)

As announced by gstein before, we've had some discussion on the
NODE_DATA structure which should allow storing multiple levels of tree
manipulation in our wc-db. This mail aims at describing my progress on
the subject so far. Please review and comment.


Introduction
----------------

What's the 4th tree about? The 4th tree is not 1 tree, but instead
it's the ability to store overlapping tree changes in our WORKING
tree. Take the following tree:

root
 +- A - C - file
 \- B - C - file

Then, imagine replacing A with B. All would be fine with our current
single level WORKING representation. However, if we replace 'file' in
the copied tree, a single level won't do anymore: if you revert the
replacement of file, you want to revert to what was there when the
tree was copied. The other option - which you don't want because it
would result in an inconsistent tree - would be that wc-ng would
revert to what was there even before the copy operation.

Being able to revert the 'file' replacement independently of the 'A'
replacement, you need 2 levels of WORKING nodes for 'file': one for
the direct replacement and one for the replacement that comes with
replacing 'A'. Using the same logic, many levels may be required to
model complicated working copy changes.


What this change is not
----------------------------------

This change does not include any change to the current behaviour of
libsvn_wc that modifying modified trees are destructive operations.
The multi-level model exists only to keep track of WORKING tree
changes, not to make changes to the ACTUAL tree visible again after
reverting a replaced subtree.



Proposed change
-------------------------

Greg made a proposal on the list some time ago which allows the
required multiplicity of WORKING nodes by creating a new table:
NODE_DATA. The table was proposed to hold a subset of the columns
currently in the BASE_NODE and WORKING_NODE tables.

The rationale about storing the BASE_NODE data in the table too is
that a query for a node which doesn't have a WORKING version will
simply return the BASE version. That way, there's no need to teach the
code about the absense of WORKING. Although the BASE_NODE information
is put in this table, this doesn't mean the BASE_NODE and WORKING_NODE
concepts are being redefined, other than allowing layered WORKING_NODE
(sub)trees.


Columns to be placed in NODE_DATA:

 * wc_id
 * local_relpath
 * oproot_distance
 * presence
 * kind
 * revnum
 * checksum
 * translated_size
 * last_mod_time
 * changed_rev
 * changed_date
 * changed_author
 * depth
 * properties
 * dav_cache
 * symlink_target
 * file_external

This means, these columns stay in WORKING_NODE (next to its key, ofcourse):

 * copyfrom_repos_id
 * copyfrom_repos_path
 * copyfrom_revnum
 * moved_here
 * moved_to

These columns can stay in WORKING_NODE, because all children inherit
their values from the oproot. I.e. a subdirectory of a copied
directory inherits the copy/move info, unless it's been copied/moved
itself, in which case it has its own copy information.


As described before, sorting the nodes relating to a certain path in
ascending order relating to their oproot, you'd always get the
'current' WORKING state applicable to the node, if the distance
between the node and the working copy root is used to identify the
BASE_NODE data.


Most -if not all- of the changes to the underlying table structure
should stay hidden behind the wc-db API.



Relevance to 1.7
----------------------

Why do we need this change now? Why can't it wait until we finished
1.7, after all, it's just polishing the way we versioned directories
in wc-1, right?

Not exactly. Currently, mixed-revision working copies are modelled
using an oproot for each subtree with its own revision number. That
means that without this change, effectively we can't represent
mixed-revision working copy trees. So, in order to achieve feature
parity with 1.6, we need to realise this change before 1.7.



Well, that's basically it. Comments?


Bye,


Erik.

Re: NODE_DATA (aka fourth tree)

Posted by Erik Huelsmann <eh...@gmail.com>.
>>  * moved_here
>>  * moved_to

On IRC, we were discussing the fact that these columns are in the
databases, but nobody seems to be planning to implement them for 1.7.
Is that your perception too? If so, we could remove them with the
upcoming schema-change required for NODE_DATA.

Bye,

Erik.

Re: NODE_DATA (aka fourth tree)

Posted by Greg Stein <gs...@gmail.com>.
On Mon, Jul 12, 2010 at 06:32, Erik Huelsmann <eh...@gmail.com> wrote:
> On Sun, Jul 11, 2010 at 1:04 AM, Greg Stein <gs...@gmail.com> wrote:
>...
>>>  * translated_size
>>>  * last_mod_time
>
> Thinking about it a bit more, I think translated_size and
> last_mod_time are a bit odd to have in NODE_DATA - although they are
> part of both BASE_NODE and WORKING_NODE: they really do apply only to
> BASE and the *current* working node: they are part of the optimization
> to determine if a file has changed. Presumably, when a different layer
> of WORKING becomes visible, we'll be recalculating both fields.
>
>
> If that's the case, shouldn't we just hold onto them in their respective tables?

Fair enough.

>...
>>>  * symlink_target
>>>  * file_external
>>
>> I'm not sure that file_external belongs here. We certainly don't have
>> it in WORKING_NODE.
>
> I've been informing around on IRC to understand the difference between
> why that would apply to file_external, but not to symlink_target. The
> difference isn't clear to me yet. Do you have anything which might
> help me?

To be honest, file external state is kind of a hand-wave. It is
possible they need storage in WORKING_NODE (well... NODE_DATA), too.
The column was added to BASE_NODE to kind of get things working.

The proper solution is to review the file externals implementation
across WC and figure out what is needed. None of the currently active
committers has a good handle on file externals.

>...

Cheers,
-g

Re: NODE_DATA (aka fourth tree)

Posted by Erik Huelsmann <eh...@gmail.com>.
On Sun, Jul 11, 2010 at 1:04 AM, Greg Stein <gs...@gmail.com> wrote:
> On Sat, Jul 10, 2010 at 17:55, Erik Huelsmann <eh...@gmail.com> wrote:
>>...
>> Columns to be placed in NODE_DATA:
>>
>>  * wc_id
>>  * local_relpath
>>  * oproot_distance
>>  * presence
>>  * kind
>>  * revnum
>
> revnum is a BASE concept, so it does not belong here. WORKING nodes do
> not have a revision until they are committed. If the node is copied
> from the repository, then the *source* of that copy needs a revision
> and path, but that is conceptually different from "revnum" (which
> identifies the rev of the node itself).
>
>>  * checksum
>>  * translated_size
>>  * last_mod_time

Thinking about it a bit more, I think translated_size and
last_mod_time are a bit odd to have in NODE_DATA - although they are
part of both BASE_NODE and WORKING_NODE: they really do apply only to
BASE and the *current* working node: they are part of the optimization
to determine if a file has changed. Presumably, when a different layer
of WORKING becomes visible, we'll be recalculating both fields.


If that's the case, shouldn't we just hold onto them in their respective tables?


>>  * changed_rev
>>  * changed_date
>>  * changed_author
>>  * depth
>>  * properties
>>  * dav_cache
>
> dav_cache is also a BASE concept, and remains in BASE_NODE.

Agreed.

>>  * symlink_target
>>  * file_external
>
> I'm not sure that file_external belongs here. We certainly don't have
> it in WORKING_NODE.

I've been informing around on IRC to understand the difference between
why that would apply to file_external, but not to symlink_target. The
difference isn't clear to me yet. Do you have anything which might
help me?

>> This means, these columns stay in WORKING_NODE (next to its key, ofcourse):
>>
>>  * copyfrom_repos_id
>>  * copyfrom_repos_path
>>  * copyfrom_revnum
>>  * moved_here
>>  * moved_to
>>
>> These columns can stay in WORKING_NODE, because all children inherit
>> their values from the oproot. I.e. a subdirectory of a copied
>> directory inherits the copy/move info, unless it's been copied/moved
>> itself, in which case it has its own copy information.
>
> Right.
>
> Also note that we can opportunistically rename the above columns to
> their wc_db API names: original_*. They would be original_repos_id,
> original_repos_relpath, original_revision.

Done. (In my local patch-in-preparation.)


Bye,


Erik.

Re: NODE_DATA (aka fourth tree)

Posted by Greg Stein <gs...@gmail.com>.
On Sat, Jul 10, 2010 at 17:55, Erik Huelsmann <eh...@gmail.com> wrote:
>...
> Columns to be placed in NODE_DATA:
>
>  * wc_id
>  * local_relpath
>  * oproot_distance
>  * presence
>  * kind
>  * revnum

revnum is a BASE concept, so it does not belong here. WORKING nodes do
not have a revision until they are committed. If the node is copied
from the repository, then the *source* of that copy needs a revision
and path, but that is conceptually different from "revnum" (which
identifies the rev of the node itself).

>  * checksum
>  * translated_size
>  * last_mod_time
>  * changed_rev
>  * changed_date
>  * changed_author
>  * depth
>  * properties
>  * dav_cache

dav_cache is also a BASE concept, and remains in BASE_NODE.

>  * symlink_target
>  * file_external

I'm not sure that file_external belongs here. We certainly don't have
it in WORKING_NODE.

> This means, these columns stay in WORKING_NODE (next to its key, ofcourse):
>
>  * copyfrom_repos_id
>  * copyfrom_repos_path
>  * copyfrom_revnum
>  * moved_here
>  * moved_to
>
> These columns can stay in WORKING_NODE, because all children inherit
> their values from the oproot. I.e. a subdirectory of a copied
> directory inherits the copy/move info, unless it's been copied/moved
> itself, in which case it has its own copy information.

Right.

Also note that we can opportunistically rename the above columns to
their wc_db API names: original_*. They would be original_repos_id,
original_repos_relpath, original_revision.

(we can also rename BASE_NODE.revnum to BASE_NODE.revision)

>...
> Relevance to 1.7
> ----------------------
>
> Why do we need this change now? Why can't it wait until we finished
> 1.7, after all, it's just polishing the way we versioned directories
> in wc-1, right?
>
> Not exactly. Currently, mixed-revision working copies are modelled
> using an oproot for each subtree with its own revision number. That
> means that without this change, effectively we can't represent
> mixed-revision working copy trees. So, in order to achieve feature
> parity with 1.6, we need to realise this change before 1.7.

Right now, we support copying a mixed-revision tree. During the copy,
we synthesize new oproots at each point where revision !=
parent_revision.

The real problem comes in with a local-add underneath a
copy/move-here. Adds do not have a marker to state "I'm not part of
the ancestor operation." We also have problems reverting child
operations, such as replacing a child of a copy/move-here. 1.6 is
apparently able to revert the replacement, returning the node to a
copied child. (maybe; there is *some* revert operation that 1.6 can do
that our current schema cannot accomplish; Philip may recall it)

So. In order to support that 1.6 revert scenario, and to better
support future revert capabilities, the NODE_DATA concept is needed.
It also solves the local-add under a copy problem.

Cheers,
-g