You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Stein <gs...@gmail.com> on 2010/09/20 20:40:18 UTC

add NODES.prior_deleted boolean column

After working through the several email messages, and discussion, I
believe we're now down to a simple change:

* add a "prior_deleted" flag to NODES

The flag simply means that a node exists prior to this layer and has
been deleted or moved-away. The 'presence' column may say the same
thing, but it might also describe data that is replacing the
deletion/move.

When a deletion (of a subtree) occurs, then we create a new layer at
<relpath, op_depth>. New rows are written for the root, and all
children, using that op_depth value. If this is a moved-away, then we
store the destination into moved_to at the root *only* (which can then
later discriminate between the two types of deletions; children need
to look to the root to discriminate; I bet this need is rare). Note
that the deletion process needs to look for mods to descendents:
deletes are integrated into this one; other operations may error with
"can't delete local mods" or somesuch.

For the following actions, these are applied to the root of a deletion:

If an add occurs, then the root is updated to set presence='added'. No
other changes are needed.

If a copied-here or moved-here occurs, then the root is updated with
the appropriate status and source information. Child nodes *may* have
their presence switched from 'deleted' to 'copied-here' or
'moved-here' (depends on whether the arriving nodes intersect with the
old namespace). New nodes may be introduced, with presence=$whatever
and prior_deleted=0 (FALSE)

If a deletion of a child (subtree) of copied-here or moved-here
occurs, then it has a new op_depth and defines a whole new layer. The
"prior_deleted" is set to 1 (TRUE) indicating the prior layer (which
happens to be the copy/move-here) has been deleted.

Deletion of an add is effectively a revert. If this is a child, then
the layer is simply removed (it only has one node). If the
deletion/revert of an add has prior_deleted=1 (meaning it is a root),
then the node is rewritten to presence='deleted', restoring it to the
state when the deletion first occurred. (and yes, a second revert
undoes the deletion, etc...).

Reverting a child of a moved/copied-here tree is invalid. When you
revert the root, then the children at this op_depth are traversed: any
nodes with prior_deleted=1 are restored to presence=deleted, and nodes
with prior_deleted=0 (newly-arrived from the copy/move) are simply
removed.

Note that prior_deleted is set to TRUE only for a deletion operation
(when presence is set to 'deleted'). That implies a prior node
existed. For the sequence [rm A/B, add A/B, add A/B/foo], the node
A/B/foo will have op_depth=3 and prior_deleted=0 since the row was
created by an add. Assuming that A/B/foo existed originally, then
prior_deleted=1 at <A/B/foo, op_depth=2>.


I think that is it. Summarized a bit better from the earlier thread.

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Posted by Julian Foad <ju...@wandisco.com>.

Pish, only the OpenOffice attachment was preserved.  Here's a PDF copy,
temporarily: <http://filebin.ca/fobqtm/nodes-states-2.pdf>.

- Julian


On Tue, 2010-09-21 at 18:41 +0100, Julian Foad wrote:
> Greg Stein wrote:
> > After working through the several email messages, and discussion, I
> > believe we're now down to a simple change:
> > 
> > * add a "prior_deleted" flag to NODES
> > 
> > The flag simply means that a node exists prior to this layer and has
> > been deleted or moved-away. The 'presence' column may say the same
> > thing, but it might also describe data that is replacing the
> > deletion/move.
> 
> Do you see this working in conjunction with the current set of presence
> values, or your proposed extended set?
> 
> That flag would just mean "There is a row for the same path with a
> smaller op_depth and with a non-negative kind of presence", right?  So
> whether we actually store that flag is a matter of impl/efficiency, not
> of logical design.  Have I understood?
> 
> The subject that this arose from was how to store all the possible
> states of a working row.  First I want to know what are all the possible
> states of a working row that we need to represent, before we decide how
> to represent them.  I don't think we have ever written them down yet, in
> full detail, so I have tried.
> 
> Please see the two tables in the "nodes-states" document that I am
> attaching as .ods and as .pdf, and as two .png images.  I'm not sure
> whether any of the attachments will get through to the list.
> 
> Table 1
> 
> The first table enumerates all the states of a row in NODES,ignoring any
> "prior-deleted" or "moved-away" part of the state if the node has also
> been replaced.  It shows whether each such state can exist in BASE
> (op_depth = 0) and/or in WORKING (op_depth > 0) rows.
> 
> The remaining columns are works in progress.  The "Can be excluded?"
> column starts to address the question "Can we copy or move a tree that
> contains an excluded node?"
> 
> Table 2
> 
> The second table starts to define the state that results from applying
> any possible structural change to a node.
> 
> 
> 
> I assume this is in conjunction with the current set of presence values,
> not your proposed extended set.  So the possible changes would be
> encoded as:
> 
> 
> 
> > When a deletion (of a subtree) occurs, then we create a new layer at
> > <relpath, op_depth>. New rows are written for the root, and all
> > children, using that op_depth value. If this is a moved-away, then we
> > store the destination into moved_to at the root *only* (which can then
> > later discriminate between the two types of deletions; children need
> > to look to the root to discriminate; I bet this need is rare). Note
> > that the deletion process needs to look for mods to descendents:
> > deletes are integrated into this one; other operations may error with
> > "can't delete local mods" or somesuch.
> > 
> > For the following actions, these are applied to the root of a deletion:
> 
> What do you mean "these are applied to the root of a deletion"?  I guess
> "add", "copy-here", "move-here" can only be applied to the root of a
> deletion or to an unversioned/not-present path; is that it?
> 
> > If an add occurs, then the root is updated to set presence='added'. No
> > other changes are needed.
> 
> Apart from setting the new node kind.  And apart from changing the
> op_depth of all its still-deleted children to obey the deletion-op-depth
> rule:
> 
>   checkout: (A/B, A/B/C, A/B/gamma),op_d=0,normal
> 
>   delete A/B:  add rows (A/B, A/B/C, A/B/gamma),op_d=2,deleted
> 
>   add new file B:  modify row (A/B),op_d=2:
>                        presence/status := deleted+added
>                        kind := file
>                    modify rows (A/B/C, A/B/gamma),op_d=2:
>                        op_d := 3
> 
> - Julian
> 
> 
> > If a copied-here or moved-here occurs, then the root is updated with
> > the appropriate status and source information. Child nodes *may* have
> > their presence switched from 'deleted' to 'copied-here' or
> > 'moved-here' (depends on whether the arriving nodes intersect with the
> > old namespace). New nodes may be introduced, with presence=$whatever
> > and prior_deleted=0 (FALSE)
> > 
> > If a deletion of a child (subtree) of copied-here or moved-here
> > occurs, then it has a new op_depth and defines a whole new layer. The
> > "prior_deleted" is set to 1 (TRUE) indicating the prior layer (which
> > happens to be the copy/move-here) has been deleted.
> > 
> > Deletion of an add is effectively a revert. If this is a child, then
> > the layer is simply removed (it only has one node). If the
> > deletion/revert of an add has prior_deleted=1 (meaning it is a root),
> > then the node is rewritten to presence='deleted', restoring it to the
> > state when the deletion first occurred. (and yes, a second revert
> > undoes the deletion, etc...).
> > 
> > Reverting a child of a moved/copied-here tree is invalid. When you
> > revert the root, then the children at this op_depth are traversed: any
> > nodes with prior_deleted=1 are restored to presence=deleted, and nodes
> > with prior_deleted=0 (newly-arrived from the copy/move) are simply
> > removed.
> > 
> > Note that prior_deleted is set to TRUE only for a deletion operation
> > (when presence is set to 'deleted'). That implies a prior node
> > existed. For the sequence [rm A/B, add A/B, add A/B/foo], the node
> > A/B/foo will have op_depth=3 and prior_deleted=0 since the row was
> > created by an add. Assuming that A/B/foo existed originally, then
> > prior_deleted=1 at <A/B/foo, op_depth=2>.
> > 
> > 
> > I think that is it. Summarized a bit better from the earlier thread.
> > 
> > Cheers,
> > -g
>

Re: add NODES.prior_deleted boolean column

Posted by Philip Martin <ph...@wandisco.com>.

Greg Stein <gs...@gmail.com> writes:

> To do this, it seems that we're going to need to expose (from wc_db.h)
> a structure containing "all" the row data. Thankfully, this big ol'
> structure is *private* to libsvn_wc, and we can change it at will
> (unlike svn_wc_status_t). I would suggest that we use a callback --
> wc_db can step through each row of the result, fill in the structure,
> and execute a callback (clearing an iterpool on each "step" with the
> cursor).

Yes, the caching is private to libsvn_wc.  It might even be private to
status within libsvn_wc initially.

> The one tricky part to a callback is eliding all but the highest
> op_depth. Does an ORDER BY in the query affect its runtime at all?

I had the following indices on NODES:

 PRIMARY KEY(wc_id, local_relpath, op_depth)
 CREATE INDEX i_parent ON NODES (wc_id, parent_relpath, local_relpath)

if I change the i_parent index to

 CREATE INDEX i_parent ON NODES (wc_id, parent_relpath, local_relpath, op_depth)

then the order by on this query

    SELECT ... FROM NODES
     WHERE wc_id = ?1 AND parent_relpath = ?2
     ORDER BY local_relpath, op_depth

doesn't add a significant overhead.

-- 
Philip

Re: add NODES.prior_deleted boolean column

Posted by Greg Stein <gs...@gmail.com>.

On Tue, Sep 28, 2010 at 08:46, Philip Martin <ph...@wandisco.com> wrote:
> Julian Foad <ju...@wandisco.com> writes:
>
>>> I believe the answer is "often". A simple 'svn status' will need to
>>> distinguish between 'A' and 'R', and that is done by checking
>>> prior_deleted.
>>>
>>> And no... this amount of denormalization will not hurt us.
>>
>> OK.  I know that "svn status" speed is a big deal.
>
> I'm not sure prior_deleted is an optimisation.  When I last looked at

Great analysis below. I'll take your word for it :-)

prior_deleted is effectively something like base_shadowed, and yeah...
if we're iterating over all nodes within a parent, then it can be
generated "simply".

> the performance of SQLite I concluded that status would be too slow if
> it ran per-node queries, it has to run per-dir queries.  So
>
>  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2
>
> is too slow, we need to run
>
>  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2
>
> iterate over the rows caching the data and then generate the status
> for each child.

To do this, it seems that we're going to need to expose (from wc_db.h)
a structure containing "all" the row data. Thankfully, this big ol'
structure is *private* to libsvn_wc, and we can change it at will
(unlike svn_wc_status_t). I would suggest that we use a callback --
wc_db can step through each row of the result, fill in the structure,
and execute a callback (clearing an iterpool on each "step" with the
cursor).

The one tricky part to a callback is eliding all but the highest
op_depth. Does an ORDER BY in the query affect its runtime at all?

>...
> but that nested SELECT is expensive.  It's not as bad as doing
> per-node queries but it is still too slow, the database query alone is
> slower than 1.6 status.  I don't think this is something that can be
> fixed using an index because on my test data the runtime generally
> goes up by orders of magnitude when the indexing is wrong.

Yeah... the more indexes present, the more that need to be maintained
when rows are modified.

> I can get round this by querying for all op_depth and using the cache
> to select the highest.  The cache is a hash, keyed on local_relpath,
> that stores the data associated with the highest op_depth seen and it
> simply discards/overwrites lower op_depth.  I've prototyped this and

If we order by local_relpath, then the "cache" is simply one row
(hanging out in a structure, waiting to be passed to that callback).

> it's as fast as the per-dir BASE_NODE query on my test data.  This is
> not surprising since my test data consists of mostly single op_depth
> with a few double op_depth.  We have to have good performance on such
> working copies because they are the most common, I think it will be
> unusual to have a large working copy where lots of nodes have a large
> number of op_depth.

Agreed.

> Now to prior_deleted.  The fundamental status query looks like
>
>  SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2
>
> and all op_depth are seen.  It's quite simple to cache the two highest
> op_depth so prior_deleted only provides information that is already
> available. It's not an optimisation for status, if anything it will
> make the database bigger and slower.

Again, thanks for the analysis. Okay... let's skip it!

>...

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Posted by Philip Martin <ph...@wandisco.com>.

Julian Foad <ju...@wandisco.com> writes:

>> I believe the answer is "often". A simple 'svn status' will need to
>> distinguish between 'A' and 'R', and that is done by checking
>> prior_deleted.
>> 
>> And no... this amount of denormalization will not hurt us.
>
> OK.  I know that "svn status" speed is a big deal.

I'm not sure prior_deleted is an optimisation.  When I last looked at
the performance of SQLite I concluded that status would be too slow if
it ran per-node queries, it has to run per-dir queries.  So

  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2

is too slow, we need to run

  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2

iterate over the rows caching the data and then generate the status
for each child.

NODES and op_depth adds a complication.  Getting the equivalent of
BASE_NODE is easy, we add "AND op_depth = 0", but status wants the row
with the highest op_depth for each node, and that's a different number
for each row in a per-dir query.  We can do it like this:

  SELECT ... FROM NODES n WHERE wc_id = ?1 AND parent_relpath = ?2
  AND op_depth = (SELECT MAX(op_depth) FROM NODES m
                   WHERE m.wc_id = n.wc_id
                     AND m.local_relpath = n.local_relpath)

but that nested SELECT is expensive.  It's not as bad as doing
per-node queries but it is still too slow, the database query alone is
slower than 1.6 status.  I don't think this is something that can be
fixed using an index because on my test data the runtime generally
goes up by orders of magnitude when the indexing is wrong.

I can get round this by querying for all op_depth and using the cache
to select the highest.  The cache is a hash, keyed on local_relpath,
that stores the data associated with the highest op_depth seen and it
simply discards/overwrites lower op_depth.  I've prototyped this and
it's as fast as the per-dir BASE_NODE query on my test data.  This is
not surprising since my test data consists of mostly single op_depth
with a few double op_depth.  We have to have good performance on such
working copies because they are the most common, I think it will be
unusual to have a large working copy where lots of nodes have a large
number of op_depth.

Now to prior_deleted.  The fundamental status query looks like

  SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2

and all op_depth are seen.  It's quite simple to cache the two highest
op_depth so prior_deleted only provides information that is already
available. It's not an optimisation for status, if anything it will
make the database bigger and slower.

Below are the Python script that generates a big NODES database, and a
C program that prototypes the status queries:

#!/usr/bin/python

import os, sqlite3

try: os.remove('wcx.db')
except: pass

c = sqlite3.connect('wcx.db')
c.execute("""create table repository (
               id integer primary key autoincrement,
               root text unique not null,
               uuid text not null)""")
c.execute("""create index i_uuid on repository (uuid)""")
c.execute("""create index i_root on repository (root)""")
c.execute("""create table wcroot (
               id integer primary key autoincrement,
               local_abspath text unique)""")
c.execute("""create unique index i_local_abspath on wcroot (local_abspath)""")
c.execute("""create table nodes (
               wc_id integer not null references wcroot (id),
               local_relpath text not null,
               op_depth integer not null,
               parent_relpath text,
               repos_id integer references repository (id),
               repos_path text,
               revision integer,
               presence text not null,
               depth text,
               moved_here integer,
               moved_to text,
               kind text not null,
               changed_revision integer,
               changed_date integer,
               changed_author text,
               checksum text
               properties blob,
               translated_size integer,
               last_mod_time integer,
               dav_cache blob,
               symlink_target text,
               file_external text,
               primary key(wc_id, local_relpath, op_depth))""")
c.execute("""create index i_parent on nodes (wc_id,
                                             parent_relpath,
                                             local_relpath)""")
c.execute("""insert into repository (root, uuid) values (
               "http://example.com/repo",
               "f738be9e-409d-481f-b246-1fb6a969aba2")""")
c.execute("""insert into wcroot(local_abspath) values ("/wc")""")

c.execute("""insert into nodes (
               wc_id,
               local_relpath,
               op_depth,
               repos_id,
               repos_path,
               parent_relpath,
               presence,
               kind)
             values (
               1,
               "",
               0,
               1,
               "trunk",
               NULL,
               "normal",
               "dir")""")

for i in range(100):
    c.execute("""insert into nodes (
                   wc_id,
                   local_relpath,
                   op_depth,
                   repos_id,
                   repos_path,
                   parent_relpath,
                   presence,
                   kind)
                 values (
                   1,
                   "foo"""+str(i)+"""",
                   0,
                   1,
                   "trunk/foo"""+str(i)+"""",
                   "",
                   "normal",
                   "file")""")
    if i >= 30:
        continue;
    c.execute("""insert into nodes (
                   wc_id,
                   local_relpath,
                   op_depth,
                   repos_id,
                   repos_path,
                   parent_relpath,
                   presence,
                   kind)
                 values (
                   1,
                   "zag"""+str(i)+"""",
                   0,
                   1,
                   "trunk/zag"""+str(i)+"""",
                   "",
                   "normal",
                   "dir")""")
    c.execute("""insert into nodes (
                   wc_id,
                   local_relpath,
                   op_depth,
                   repos_id,
                   repos_path,
                   parent_relpath,
                   presence,
                   kind)
                 values (
                   1,
                   "zig"""+str(i)+"""",
                   0,
                   1,
                   "trunk/zig"""+str(i)+"""",
                   "",
                   "normal",
                   "dir")""")

    for j in range(100):
        c.execute("""insert into nodes (
                       wc_id,
                       local_relpath,
                       op_depth,
                       repos_id,
                       repos_path,
                       parent_relpath,
                       presence,
                       kind)
                     values (
                       1,
                       "zag"""+str(i)+"/foo"+str(j)+"""",
                       0,
                       1,
                       "trunk/zag"""+str(i)+"/foo"+str(j)+"""",
                       "zag"""+str(i)+"""",
                       "normal",
                       "file")""")
        if j % 10 == 1:
          c.execute("""insert into nodes (
                         wc_id,
                         local_relpath,
                         op_depth,
                         repos_id,
                         repos_path,
                         parent_relpath,
                         presence,
                         kind)
                       values (
                         1,
                         "zag"""+str(i)+"/foo"+str(j)+"""",
                         3,
                         1,
                         "trunk/zag"""+str(i)+"/foo"+str(j)+"""",
                         "zag"""+str(i)+"""",
                         "base-delete",
                         "file")""")
          c.execute("""insert into nodes (
                         wc_id,
                         local_relpath,
                         op_depth,
                         repos_id,
                         repos_path,
                         parent_relpath,
                         presence,
                         kind)
                       values (
                         1,
                         "zag"""+str(i)+"/bar"+str(j)+"""",
                         3,
                         null,
                         null,
                         "zag"""+str(i)+"""",
                         "normal",
                         "file")""")
        c.execute("""insert into nodes (
                       wc_id,
                       local_relpath,
                       op_depth,
                       repos_id,
                       repos_path,
                       parent_relpath,
                       presence,
                       kind)
                     values (
                       1,
                       "zig"""+str(i)+"/foo"+str(j)+"""",
                       0,
                       1,
                       "trunk/zig"""+str(i)+"/foo"+str(j)+"""",
                       "zig"""+str(i)+"""",
                       "normal",
                       "file")""")
        if j >= 30:
            continue
        c.execute("""insert into nodes (
                       wc_id,
                       local_relpath,
                       op_depth,
                       repos_id,
                       repos_path,
                       parent_relpath,
                       presence,
                       kind)
                     values (
                       1,
                       "zig"""+str(i)+"/zag"+str(j)+"""",
                       0,
                       1,
                       "trunk/zig"""+str(i)+"/zag"+str(j)+"""",
                       "zig"""+str(i)+"""",
                       "normal",
                       "dir")""")
        for k in range(100):
            c.execute("""insert into nodes (
                           wc_id,
                           local_relpath,
                           op_depth,
                           repos_id,
                           repos_path,
                           parent_relpath,
                           presence,
                           kind)
                         values (
                           1,
                           "zig"""+str(i)+"/zag"+str(j)+"/foo"+str(k)+"""",
                           0,
                           1,
                           "trunk/zig"""+str(i)+"/zag"+str(j)+"/foo"+str(k)+"""",
                           "zig"""+str(i)+"/zag"+str(j)+"""",
                           "normal",
                           "file")""")

c.commit()


#include "svn_pools.h"
#include "svn_sqlite.h"
#include <stdio.h>

static svn_error_t *
status_query_per_node(svn_sqlite__db_t *sdb,
                      const char *local_relpath,
                      svn_boolean_t display,
                      apr_pool_t *pool)
{
  svn_sqlite__stmt_t *stmt;
  svn_boolean_t have_row;
  const char *kind;
  apr_pool_t *subpool;
  apr_array_header_t *children;
  int num, i;

  /* Display this node. */
  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 0));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  if (!have_row)
    {
      SVN_ERR(svn_sqlite__reset(stmt));
      return SVN_NO_ERROR;
    }
  kind = svn_sqlite__column_text(stmt, 0, pool);
  SVN_ERR(svn_sqlite__reset(stmt));
  if (display)
    printf("%s %s\n", local_relpath, kind);
  
  if (strcmp(kind, "dir"))
    return SVN_NO_ERROR;

  /* Gather children. */
#if 0
  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 2));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  if (!have_row)
    {
      SVN_ERR(svn_sqlite__reset(stmt));
      return SVN_NO_ERROR;
    }
  num = svn_sqlite__column_int64(stmt, 0);
  SVN_ERR(svn_sqlite__reset(stmt));
#else
  num = 10;
#endif
  children = apr_array_make(pool, num, sizeof(const char *));
  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 5));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  while (have_row)
    {
      APR_ARRAY_PUSH(children, const char *)
        = svn_sqlite__column_text(stmt, 0, pool);
      SVN_ERR(svn_sqlite__step(&have_row, stmt));
    }
  SVN_ERR(svn_sqlite__reset(stmt));

  /* Display children. */
  subpool = svn_pool_create(pool);
  for (i = 0; i < children->nelts; ++i)
    {
      const char *child_relpath = APR_ARRAY_IDX(children, i, const char *);

      svn_pool_clear(subpool);
      SVN_ERR(status_query_per_node(sdb, child_relpath, display, subpool));
    }
  svn_pool_destroy(subpool);

  return SVN_NO_ERROR;
}

struct node_t {
  svn_boolean_t is_dir;
  int op_depth;
};

static svn_error_t *
status_query_per_dir(svn_sqlite__db_t *sdb,
                     const char *local_relpath,
                     svn_boolean_t display,
                     apr_pool_t *pool)
{
  svn_sqlite__stmt_t *stmt;
  svn_boolean_t have_row;
  const char *kind;
  apr_pool_t *subpool;
  apr_hash_t *children = apr_hash_make(pool);
  apr_hash_index_t *hi;

  /* Display this node. */
  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 0));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  if (!have_row)
    {
      SVN_ERR(svn_sqlite__reset(stmt));
      return SVN_NO_ERROR;
    }
  kind = svn_sqlite__column_text(stmt, 0, pool);
  SVN_ERR(svn_sqlite__reset(stmt));
  if (display)
    printf("%s %s\n", local_relpath, kind);
  
  if (strcmp(kind, "dir"))
    return SVN_NO_ERROR;

  /* Gather children. */
  SVN_ERR(svn_sqlite__get_statement(&stmt, sdb, 1));
  SVN_ERR(svn_sqlite__bindf(stmt, "is", 1, local_relpath));
  SVN_ERR(svn_sqlite__step(&have_row, stmt));
  while (have_row)
    {
      const char *child_relpath = svn_sqlite__column_text(stmt, 0, NULL);
      int op_depth = svn_sqlite__column_int64(stmt, 2);
      struct node_t *node;

      node = apr_hash_get(children, child_relpath, APR_HASH_KEY_STRING);
      if (!node)
        {
          int klen = strlen(child_relpath);
          node = apr_palloc(pool, sizeof(*node));
          node->op_depth = -1;
          apr_hash_set(children, apr_pstrmemdup(pool, child_relpath, klen), 
                       klen, node);
        }
      if (op_depth > node->op_depth)
        {
          node->op_depth = op_depth;
          node->is_dir = !strcmp("dir", svn_sqlite__column_text(stmt, 1, NULL));
        }
      SVN_ERR(svn_sqlite__step(&have_row, stmt));
    }
  SVN_ERR(svn_sqlite__reset(stmt));

  /* Display children. */
  subpool = svn_pool_create(pool);
  for (hi = apr_hash_first(pool, children); hi; hi = apr_hash_next(hi))
    {
      struct node_t *node;
      svn_pool_clear(subpool);

      node = svn__apr_hash_index_val(hi);
      if (node->is_dir)
        SVN_ERR(status_query_per_dir(sdb, svn__apr_hash_index_key(hi), display,
                                     subpool));
      else if (display)
        printf("%s file\n", (const char*)svn__apr_hash_index_key(hi));
    }
  svn_pool_destroy(subpool);

  return SVN_NO_ERROR;
}

static void usage(const char *arg)
{
  fprintf(stderr,
          "usage: %s n|d q|v s|m db\n"
          "where: n does per-node queries\n"
          "       d does per-dir queries\n"
          "       q suppresses output\n"
          "       v displays ouptut\n"
          "       s use a single transaction\n"
          "       m uses multiple transactions\n", arg);
  exit(1);
}

static void parse_opts(int argc,
                       const char *argv[],
                       svn_boolean_t *per_dir,
                       svn_boolean_t *display,
                       svn_boolean_t *savepoint)
{
  if (argc != 5)
    usage(argv[0]);

  if (!strcmp(argv[1], "n"))
    *per_dir = FALSE;
  else if (!strcmp(argv[1], "d"))
    *per_dir = TRUE;
  else
    usage(argv[0]);

  if (!strcmp(argv[2], "q"))
    *display = FALSE;
  else if (!strcmp(argv[2], "v"))
    *display = TRUE;
  else
    usage(argv[0]);

  if (!strcmp(argv[3], "s"))
    *savepoint = TRUE;
  else if (!strcmp(argv[3], "m"))
    *savepoint = FALSE;
  else
    usage(argv[0]);
}

int main(int argc, const char *argv[])
{
  apr_pool_t *pool;  
  svn_sqlite__db_t *sdb;
  svn_boolean_t per_dir, display, savepoint;
  const char * const statements[] = {
    /* 0 */
    "select kind from nodes"
    " where wc_id = ?1 and local_relpath = ?2 order by op_depth desc limit 1",
    /* 1 */
    "select local_relpath, kind, op_depth from nodes"
    " where wc_id = ?1 and parent_relpath = ?2",
    /* 2 */
    "select count(*) from (select distinct local_relpath from nodes"
    "                       where wc_id = ?1 and parent_relpath = ?2)",
    /* 3 */
    "savepoint q",
    /* 4 */
    "release savepoint q",
    /* 5 */
    "select distinct local_relpath from nodes"
    " where wc_id = ?1 and parent_relpath = ?2",
    /* 6 */
    "select local_relpath from nodes"
    " where wc_id = ?1 and parent_relpath = ?2",
    /* 7 */
    "select local_relpath, kind, op_depth from nodes n"
    " where wc_id = ?1 and parent_relpath = ?2"
    "   and op_depth = (select max(op_depth) from nodes m"
    "                    where m.wc_id = n.wc_id"
    "                      and m.local_relpath = n.local_relpath)",
    NULL
  };

  parse_opts(argc, argv, &per_dir, &display, &savepoint);
  apr_initialize();
  pool = svn_pool_create(NULL);
  SVN_INT_ERR(svn_sqlite__open(&sdb, argv[argc - 1], svn_sqlite__mode_rwcreate,
                               statements, 0, NULL, pool, pool));
  if (savepoint)
    SVN_INT_ERR(svn_sqlite__exec_statements(sdb, 3));

  if (per_dir)
    SVN_INT_ERR(status_query_per_dir(sdb, "", display, pool));
  else
    SVN_INT_ERR(status_query_per_node(sdb, "", display, pool));

  if (savepoint)
    SVN_INT_ERR(svn_sqlite__exec_statements(sdb, 4));

  return EXIT_SUCCESS;
}


-- 
Philip

Re: add NODES.prior_deleted boolean column

Posted by Greg Stein <gs...@gmail.com>.

On Thu, Sep 23, 2010 at 13:32, Julian Foad <ju...@wandisco.com> wrote:
>...
>> > 'excluded':
>> >
>> > I think we need to support 'excluded' at op_depth > 0 on a copied-here
>> > node (only for a child, not the root of the copy), and also on a
>> > moved-here node (child).  We should distinguish these.  How?
>>
>> "should"? Why? Again, looking at the parent immediately tells us what
>> is going on. But I don't even see where/how/why we need to know this
>> information.
>
> Maybe we'll never need to know, in practice.  But I'm thinking from a
> "clean design" rather than a "how frequently are we likely to ask" point
> of view.  For sanity there should be one way of asking questions such as
> "is this part of a copy", not two ways.

I don't know that that question is ever asked of an excluded node, is it?

I agree with minimizing exceptions, in general, but I'm not sure that
having multiple 'exception' presence values is useful. Especially when
a simple query on the parent provides the discriminator.

>...
>> At the moment, we do not record the local_relpath of the source of a
>> copy/move. (it may be null for a repo->WC copy; I believe we would not
>> allow a repo->WC move)
>
> Hmm.  If the 'moved-*' presence values are indeed intended to support
> atomic moves (in the future), we may need to re-think how they are to do
> so.  Storing the local_relpath (and op_depth I guess!) of the source is
> probably going to be a necessary part of the puzzle.
>...
>> > One of the particular desired behaviours of a move is that 'update' will
>> > still update it.
>>
>> So you say. I don't know that that is true.
>>
>> After a WC-WC copy, we don't update the copied nodes (afaik), so I
>> don't see that this immediately follows.
>
> We should discuss support for 'moves' separately.

Yeah. I didn't want to get too deep-ended on move semantics.
Representation within the database? Sure. But I'm also fine with
getting a first-order approximation and deferring until 1.8 (or
whenever we push moves down into the database).

>...
>> > Except for reverting, where you can get the reverse of your last point.
>> > For example, follow your last step "3." with "4. revert B. delete row
>> > <A/B, 2>. modify rows <A/B/C, 2>, <A/B/gamma, 2>: op_depth = 3.
>>
>> No. Step 3 subsumed the delete from step 2. All knowledge of it is
>> lost. A revert will delete the three rows at op_depth=2 and expose the
>> original checked-out rows.
>
> I meant a non-recursive revert, which just reverts 'A/B', and leaves its
> children deleted.  That's when it has to re-write the children as if
> they had been deleted individually.

Oh! I hadn't thought of a non-recursive revert. You're quite right in that case!

Thanks,
-g

Re: add NODES.prior_deleted boolean column

Posted by Julian Foad <ju...@wandisco.com>.

Hi Greg.

Thanks for the discussion.  I'm at least getting to grips with the
basics of op_depth functioning.

Greg Stein wrote:
> On Thu, Sep 23, 2010 at 09:19, Julian Foad <ju...@wandisco.com> wrote:
> > On Wed, 2010-09-22, Greg Stein wrote:
> >...
> >> > That flag would just mean "There is a row for the same path with a
> >> > smaller op_depth and with a non-negative kind of presence", right?  So
> >> > whether we actually store that flag is a matter of impl/efficiency, not
> >> > of logical design.  Have I understood?
> >>
> >> No. It means that this (relpath/op_depth) layer was created by a
> >> deletion ("svn rm" or "svn mv").
> >
> > OK, that's fine.  What I'm trying to ascertain next is whether this
> > information is already stored in the table without adding this extra
> > column.  Let me try again, differently:
> 
> Sure, you could derive the column by examining layers underneath. The
> semantics of add/copy/move is "not allowed if something exists", so
> anything that was there has been deleted first.

Thanks for confirming my understanding of that.

> The underlying question is "what is the query pattern? do we ask 'was
> something deleted by this operation?' very often?". If the answer is
> yes, then keep the flag. If we don't ask that question often, then run
> the extra query looking for prior nodes.
> 
> I believe the answer is "often". A simple 'svn status' will need to
> distinguish between 'A' and 'R', and that is done by checking
> prior_deleted.
> 
> And no... this amount of denormalization will not hurt us.

OK.  I know that "svn status" speed is a big deal.

> >...
> >> op_depth > 0 can have the following presence values: added, deleted,
> >> copied-here, moved-here, excluded, incomplete. prior_deleted may be 0
> >> or 1. moved_to is a discriminator between deleted and moved-away.
> >
> > Great, this is good.
> >
> > Let's go into a bit more detail on some of the op_depth > 0 values.
> >
> > 'incomplete':
> >
> > Let's see where 'incomplete' can be used at op_depth > 0.  Primarily
> > during a repo->WC or WC->WC copy, constructing the 'copied-here' dirs.
> > Maybe also during a (WC->WC) move, constructing the 'moved-here' dirs,
> > so that the whole subtree need not be moved in a single database
> > operation.  If so, I'm wondering whether we need to distinguish
> > copied-here-incomplete from moved-here-incomplete.  Or maybe we can
> > always rely on the caller remembering what it's doing and setting the
> > correct state afterwards.  I'm not sure.  If we start out assuming it
> > refers to a copy-here, and later find that we do need to distinguish
> > these, then I think we could do so at that time by introducing another
> > value.
> 
> I don't think we need to discriminate *how* or *why* the incomplete
> nodes were created... just that they are. And if find we need to ask
> that question, then we can examine the parent node, or we can add new
> presence values.

OK.

> > 'excluded':
> >
> > I think we need to support 'excluded' at op_depth > 0 on a copied-here
> > node (only for a child, not the root of the copy), and also on a
> > moved-here node (child).  We should distinguish these.  How?
> 
> "should"? Why? Again, looking at the parent immediately tells us what
> is going on. But I don't even see where/how/why we need to know this
> information.

Maybe we'll never need to know, in practice.  But I'm thinking from a
"clean design" rather than a "how frequently are we likely to ask" point
of view.  For sanity there should be one way of asking questions such as
"is this part of a copy", not two ways.

> >...
> > 'moved-here':
> >
> > I believe the idea of the 'move' in WC-NG is that it represents an
> > atomic move.  We should consider how a move is to be represented, in
> > full.
> >
> > A 'moved-here' row's 'moved_to' column points to the corresponding
> > 'moved-away' row.  The 'source' of the move is the row that is shadowed
> > by the 'moved-away' row.
> 
> Huh?

Oops - I got the 'moved_to' direction mixed up.  Hence the bit I wrote
below about looking up via the source.

> A 'moved-here' node has repository source information given in the
> repos_* columns.
> 
> A 'moved-away' node has presence='deleted' and moved_to set to the
> destination local_relpath.

Yes and yes.

> At the moment, we do not record the local_relpath of the source of a
> copy/move. (it may be null for a repo->WC copy; I believe we would not
> allow a repo->WC move)

Hmm.  If the 'moved-*' presence values are indeed intended to support
atomic moves (in the future), we may need to re-think how they are to do
so.  Storing the local_relpath (and op_depth I guess!) of the source is
probably going to be a necessary part of the puzzle.

> > The destination of the move must always refer to the same repo node-rev
> > as the source did.  Therefore its repo node-rev columns and stored
> > (pristine) content columns (file checksum, link target, dir depth) could
> > be either stored as duplicates of the source row values, or stored as
> > null and always looked up via the source row when needed.
> 
> Duplicates. No need to make querying harder than it should be.
> 
> > One of the particular desired behaviours of a move is that 'update' will
> > still update it.
> 
> So you say. I don't know that that is true.
> 
> After a WC-WC copy, we don't update the copied nodes (afaik), so I
> don't see that this immediately follows.

We should discuss support for 'moves' separately.

> >...
> >> On Wed, Sep 22, 2010 at 18:27, Greg Stein <gs...@gmail.com> wrote:
> >> >...
> >> > The children don't have to be touched. They just hang out in their
> >> > deleted state with the same op_depth. We *never* want to modify a rows
> >> > op_depth. That is part of its primary key(!).
> >>
> >> Let me clarify this. There is one situation where we (effectively)
> >> modify the op_depth.
> >>
> >> Consider:
> >>
> >> 1. checkout. add rows <A, 0>, <A/B, 0>, <A/B/C, 0>, <A/B/gamma, 0>:
> >> presence=normal.
> >> 2. delete gamma. add row <A/B/gamma, 3>: presence=deleted.
> >> 3. delete B. add rows <A/B, 2>, <A/B/C, 2>: presence=deleted. modify
> >> <A/B/gamma, 3>: op_depth=2
> >>
> >> We subsume the child deletion into the parent deletion. We could just
> >> as well revert/forget the child deletion and add rows for all children
> >> at op_depth, or we can just rewrite the op_depth. Net is the same.
> >>
> >>
> >> Outside of this deletion, I don't think we want to be modifying
> >> op_depth at all.
> >
> > Except for reverting, where you can get the reverse of your last point.
> > For example, follow your last step "3." with "4. revert B. delete row
> > <A/B, 2>. modify rows <A/B/C, 2>, <A/B/gamma, 2>: op_depth = 3.
> 
> No. Step 3 subsumed the delete from step 2. All knowledge of it is
> lost. A revert will delete the three rows at op_depth=2 and expose the
> original checked-out rows.

I meant a non-recursive revert, which just reverts 'A/B', and leaves its
children deleted.  That's when it has to re-write the children as if
they had been deleted individually.

- Julian

Re: add NODES.prior_deleted boolean column

Posted by Greg Stein <gs...@gmail.com>.

On Thu, Sep 23, 2010 at 09:19, Julian Foad <ju...@wandisco.com> wrote:
> On Wed, 2010-09-22, Greg Stein wrote:
>...
>> > That flag would just mean "There is a row for the same path with a
>> > smaller op_depth and with a non-negative kind of presence", right?  So
>> > whether we actually store that flag is a matter of impl/efficiency, not
>> > of logical design.  Have I understood?
>>
>> No. It means that this (relpath/op_depth) layer was created by a
>> deletion ("svn rm" or "svn mv").
>
> OK, that's fine.  What I'm trying to ascertain next is whether this
> information is already stored in the table without adding this extra
> column.  Let me try again, differently:

Sure, you could derive the column by examining layers underneath. The
semantics of add/copy/move is "not allowed if something exists", so
anything that was there has been deleted first.

The underlying question is "what is the query pattern? do we ask 'was
something deleted by this operation?' very often?". If the answer is
yes, then keep the flag. If we don't ask that question often, then run
the extra query looking for prior nodes.

I believe the answer is "often". A simple 'svn status' will need to
distinguish between 'A' and 'R', and that is done by checking
prior_deleted.

And no... this amount of denormalization will not hurt us.

>...
>> op_depth > 0 can have the following presence values: added, deleted,
>> copied-here, moved-here, excluded, incomplete. prior_deleted may be 0
>> or 1. moved_to is a discriminator between deleted and moved-away.
>
> Great, this is good.
>
> Let's go into a bit more detail on some of the op_depth > 0 values.
>
> 'incomplete':
>
> Let's see where 'incomplete' can be used at op_depth > 0.  Primarily
> during a repo->WC or WC->WC copy, constructing the 'copied-here' dirs.
> Maybe also during a (WC->WC) move, constructing the 'moved-here' dirs,
> so that the whole subtree need not be moved in a single database
> operation.  If so, I'm wondering whether we need to distinguish
> copied-here-incomplete from moved-here-incomplete.  Or maybe we can
> always rely on the caller remembering what it's doing and setting the
> correct state afterwards.  I'm not sure.  If we start out assuming it
> refers to a copy-here, and later find that we do need to distinguish
> these, then I think we could do so at that time by introducing another
> value.

I don't think we need to discriminate *how* or *why* the incomplete
nodes were created... just that they are. And if find we need to ask
that question, then we can examine the parent node, or we can add new
presence values.

> 'excluded':
>
> I think we need to support 'excluded' at op_depth > 0 on a copied-here
> node (only for a child, not the root of the copy), and also on a
> moved-here node (child).  We should distinguish these.  How?

"should"? Why? Again, looking at the parent immediately tells us what
is going on. But I don't even see where/how/why we need to know this
information.

>...
> 'moved-here':
>
> I believe the idea of the 'move' in WC-NG is that it represents an
> atomic move.  We should consider how a move is to be represented, in
> full.
>
> A 'moved-here' row's 'moved_to' column points to the corresponding
> 'moved-away' row.  The 'source' of the move is the row that is shadowed
> by the 'moved-away' row.

Huh?

A 'moved-here' node has repository source information given in the
repos_* columns.

A 'moved-away' node has presence='deleted' and moved_to set to the
destination local_relpath.

At the moment, we do not record the local_relpath of the source of a
copy/move. (it may be null for a repo->WC copy; I believe we would not
allow a repo->WC move)

> The destination of the move must always refer to the same repo node-rev
> as the source did.  Therefore its repo node-rev columns and stored
> (pristine) content columns (file checksum, link target, dir depth) could
> be either stored as duplicates of the source row values, or stored as
> null and always looked up via the source row when needed.

Duplicates. No need to make querying harder than it should be.

> One of the particular desired behaviours of a move is that 'update' will
> still update it.

So you say. I don't know that that is true.

After a WC-WC copy, we don't update the copied nodes (afaik), so I
don't see that this immediately follows.

>...
>> On Wed, Sep 22, 2010 at 18:27, Greg Stein <gs...@gmail.com> wrote:
>> >...
>> > The children don't have to be touched. They just hang out in their
>> > deleted state with the same op_depth. We *never* want to modify a rows
>> > op_depth. That is part of its primary key(!).
>>
>> Let me clarify this. There is one situation where we (effectively)
>> modify the op_depth.
>>
>> Consider:
>>
>> 1. checkout. add rows <A, 0>, <A/B, 0>, <A/B/C, 0>, <A/B/gamma, 0>:
>> presence=normal.
>> 2. delete gamma. add row <A/B/gamma, 3>: presence=deleted.
>> 3. delete B. add rows <A/B, 2>, <A/B/C, 2>: presence=deleted. modify
>> <A/B/gamma, 3>: op_depth=2
>>
>> We subsume the child deletion into the parent deletion. We could just
>> as well revert/forget the child deletion and add rows for all children
>> at op_depth, or we can just rewrite the op_depth. Net is the same.
>>
>>
>> Outside of this deletion, I don't think we want to be modifying
>> op_depth at all.
>
> Except for reverting, where you can get the reverse of your last point.
> For example, follow your last step "3." with "4. revert B. delete row
> <A/B, 2>. modify rows <A/B/C, 2>, <A/B/gamma, 2>: op_depth = 3.

No. Step 3 subsumed the delete from step 2. All knowledge of it is
lost. A revert will delete the three rows at op_depth=2 and expose the
original checked-out rows.

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Posted by Julian Foad <ju...@wandisco.com>.

On Wed, 2010-09-22, Greg Stein wrote:
> On Tue, Sep 21, 2010 at 13:41, Julian Foad wrote:
> > Greg Stein wrote:
> >> After working through the several email messages, and discussion, I
> >> believe we're now down to a simple change:
> >>
> >> * add a "prior_deleted" flag to NODES
> >>
> >> The flag simply means that a node exists prior to this layer and has
> >> been deleted or moved-away. The 'presence' column may say the same
> >> thing, but it might also describe data that is replacing the
> >> deletion/move.
> >
> > Do you see this working in conjunction with the current set of presence
> > values, or your proposed extended set?
> 
> The extended set. The "current set" is dead to me.

OK.

> > That flag would just mean "There is a row for the same path with a
> > smaller op_depth and with a non-negative kind of presence", right?  So
> > whether we actually store that flag is a matter of impl/efficiency, not
> > of logical design.  Have I understood?
> 
> No. It means that this (relpath/op_depth) layer was created by a
> deletion ("svn rm" or "svn mv").

OK, that's fine.  What I'm trying to ascertain next is whether this
information is already stored in the table without adding this extra
column.  Let me try again, differently:

  If this (relpath/op_depth) row shadows another row, and the shadowed
row's presence is not 'deleted', then it's either deleting or replacing
something.  So prior_deleted = TRUE.

  If it shadows a row that is 'deleted', then this row must be creating
a replacement: it doesn't need to delete first, and it can't be just a
delete.  So prior_delete = FALSE.

  If it doesn't shadow another row, then it's creating a new node
(add/cp/mv-here).  So prior_delete = FALSE.

So prior_deleted == "this row shadows another row, and that shadowed row
is not 'deleted'".

That works for all the cases I can hold in my head.  Maybe I've missed
some cases involving switched nodes, or mixed revs, or something, that
break that rule.

> Additional operations (add, copy-here, move-here) might alter this
> layer, but it started with the deletion of a prior node/subtree.
> 
> > The subject that this arose from was how to store all the possible
> > states of a working row.  First I want to know what are all the possible
> > states of a working row that we need to represent, before we decide how
> > to represent them.  I don't think we have ever written them down yet, in
> > full detail, so I have tried.
> >
> > Please see the two tables in the "nodes-states" document that I am
> > attaching as .ods and as .pdf, and as two .png images.  I'm not sure
> > whether any of the attachments will get through to the list.
> 
> These tables are so much more complicated that it is very hard to
> read/understand them. It is billowing the combinatorics of what seems
> pretty darned simple.

I'd be more than happy to see the rules written in some other form.  I'm
just trying to get the design spelled out precisely, it doesn't matter
to me whether that includes this kind of table or not.  But at this
stage, thinking about each of the combinations is helpful to me.

> op_depth==0 can have the following presence values: normal,
> not-present, excluded, incomplete, unauthz. prior_deleted is always 0
> (FALSE).
> 
> op_depth > 0 can have the following presence values: added, deleted,
> copied-here, moved-here, excluded, incomplete. prior_deleted may be 0
> or 1. moved_to is a discriminator between deleted and moved-away.

Great, this is good.

Let's go into a bit more detail on some of the op_depth > 0 values.

'incomplete':

Let's see where 'incomplete' can be used at op_depth > 0.  Primarily
during a repo->WC or WC->WC copy, constructing the 'copied-here' dirs.
Maybe also during a (WC->WC) move, constructing the 'moved-here' dirs,
so that the whole subtree need not be moved in a single database
operation.  If so, I'm wondering whether we need to distinguish
copied-here-incomplete from moved-here-incomplete.  Or maybe we can
always rely on the caller remembering what it's doing and setting the
correct state afterwards.  I'm not sure.  If we start out assuming it
refers to a copy-here, and later find that we do need to distinguish
these, then I think we could do so at that time by introducing another
value.

'excluded':

I think we need to support 'excluded' at op_depth > 0 on a copied-here
node (only for a child, not the root of the copy), and also on a
moved-here node (child).  We should distinguish these.  How?

  * Call them 'copied-here-excluded' and 'moved-here-excluded'.

  * Make 'excluded' a separate flag.

  * Go back to distinguishing 'move' from 'copy' by means of a flag
instead of difference 'presence' values.

'moved-here':

I believe the idea of the 'move' in WC-NG is that it represents an
atomic move.  We should consider how a move is to be represented, in
full.

A 'moved-here' row's 'moved_to' column points to the corresponding
'moved-away' row.  The 'source' of the move is the row that is shadowed
by the 'moved-away' row.

The destination of the move must always refer to the same repo node-rev
as the source did.  Therefore its repo node-rev columns and stored
(pristine) content columns (file checksum, link target, dir depth) could
be either stored as duplicates of the source row values, or stored as
null and always looked up via the source row when needed.

One of the particular desired behaviours of a move is that 'update' will
still update it.  If we keep the idea that 'update' just updates the
BASE layer, we should design it so that the 'moved-here' row has null
for its node-rev columns, and the APIs that look at a working layer row
should look up the values via the move-source row when needed.

Alternatively, if we duplicate those columns then 'update' will have to
follow moves and adjust the 'moved-here' columns to keep them in sync.

> >...
> > I assume this is in conjunction with the current set of presence values,
> > not your proposed extended set.  So the possible changes would be
> > encoded as:
> >
> 
> Did you omit something here?

Oops, I meant to delete that paragraph.

> (I got the email as you sent it directly to me; no missing
> attachments; it seems like above is missing something tho)
> 
> >> When a deletion (of a subtree) occurs, then we create a new layer at
> >> <relpath, op_depth>. New rows are written for the root, and all
> >> children, using that op_depth value. If this is a moved-away, then we
> >> store the destination into moved_to at the root *only* (which can then
> >> later discriminate between the two types of deletions; children need
> >> to look to the root to discriminate; I bet this need is rare). Note
> >> that the deletion process needs to look for mods to descendents:
> >> deletes are integrated into this one; other operations may error with
> >> "can't delete local mods" or somesuch.
> >>
> >> For the following actions, these are applied to the root of a deletion:
> >
> > What do you mean "these are applied to the root of a deletion"?  I guess
> > "add", "copy-here", "move-here" can only be applied to the root of a
> > deletion or to an unversioned/not-present path; is that it?
> 
> Correct. You cannot add/copy-here/move-here to a descendent of the
> root of a deletion -- the root is still missing, so you have no parent
> for the new nodes.
> 
> >> If an add occurs, then the root is updated to set presence='added'. No
> >> other changes are needed.
> >
> > Apart from setting the new node kind.  And apart from changing the
> > op_depth of all its still-deleted children to obey the deletion-op-depth
> > rule:
> >
> >  checkout: (A/B, A/B/C, A/B/gamma),op_d=0,normal
> >
> >  delete A/B:  add rows (A/B, A/B/C, A/B/gamma),op_d=2,deleted
> 
> clarify as: presence=deleted. prior_deleted=TRUE.
> 
> >  add new file B:  modify row (A/B),op_d=2:
> >                       presence/status := deleted+added
> >                       kind := file
> 
> Yes.
> 
> >                   modify rows (A/B/C, A/B/gamma),op_d=2:
> >                       op_d := 3
> 
> No.
> 
> The children don't have to be touched. They just hang out in their
> deleted state with the same op_depth.

Oh yes, my mistake.

>  We *never* want to modify a rows
> op_depth. That is part of its primary key(!).
> 
> If you changed your operation above to: add subtree B, B/C, B/gamma,
> then you would have:
> 
> 1. modify <A/B, 2>: presence=added. (and kind=dir)
> 2. add <A/B/C, 3>: presence=added. kind=dir.
> 3. add <A/B/gamma, 3>: presence=added. kind=file.
> 
> The B/C and B/gamma nodes are at a higher op_depth, so they are now
> visible instead of the deleted nodes.

> On Wed, Sep 22, 2010 at 18:27, Greg Stein <gs...@gmail.com> wrote:
> >...
> > The children don't have to be touched. They just hang out in their
> > deleted state with the same op_depth. We *never* want to modify a rows
> > op_depth. That is part of its primary key(!).
> 
> Let me clarify this. There is one situation where we (effectively)
> modify the op_depth.
> 
> Consider:
> 
> 1. checkout. add rows <A, 0>, <A/B, 0>, <A/B/C, 0>, <A/B/gamma, 0>:
> presence=normal.
> 2. delete gamma. add row <A/B/gamma, 3>: presence=deleted.
> 3. delete B. add rows <A/B, 2>, <A/B/C, 2>: presence=deleted. modify
> <A/B/gamma, 3>: op_depth=2
> 
> We subsume the child deletion into the parent deletion. We could just
> as well revert/forget the child deletion and add rows for all children
> at op_depth, or we can just rewrite the op_depth. Net is the same.
> 
> 
> Outside of this deletion, I don't think we want to be modifying
> op_depth at all.

Except for reverting, where you can get the reverse of your last point.
For example, follow your last step "3." with "4. revert B. delete row
<A/B, 2>. modify rows <A/B/C, 2>, <A/B/gamma, 2>: op_depth = 3.

- Julian

Re: add NODES.prior_deleted boolean column

Posted by Greg Stein <gs...@gmail.com>.

On Wed, Sep 22, 2010 at 18:27, Greg Stein <gs...@gmail.com> wrote:
>...
> The children don't have to be touched. They just hang out in their
> deleted state with the same op_depth. We *never* want to modify a rows
> op_depth. That is part of its primary key(!).

Let me clarify this. There is one situation where we (effectively)
modify the op_depth.

Consider:

1. checkout. add rows <A, 0>, <A/B, 0>, <A/B/C, 0>, <A/B/gamma, 0>:
presence=normal.
2. delete gamma. add row <A/B/gamma, 3>: presence=deleted.
3. delete B. add rows <A/B, 2>, <A/B/C, 2>: presence=deleted. modify
<A/B/gamma, 3>: op_depth=2

We subsume the child deletion into the parent deletion. We could just
as well revert/forget the child deletion and add rows for all children
at op_depth, or we can just rewrite the op_depth. Net is the same.

Outside of this deletion, I don't think we want to be modifying op_depth at all.

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Posted by Greg Stein <gs...@gmail.com>.

On Tue, Sep 21, 2010 at 13:41, Julian Foad <ju...@wandisco.com> wrote:
> Greg Stein wrote:
>> After working through the several email messages, and discussion, I
>> believe we're now down to a simple change:
>>
>> * add a "prior_deleted" flag to NODES
>>
>> The flag simply means that a node exists prior to this layer and has
>> been deleted or moved-away. The 'presence' column may say the same
>> thing, but it might also describe data that is replacing the
>> deletion/move.
>
> Do you see this working in conjunction with the current set of presence
> values, or your proposed extended set?

The extended set. The "current set" is dead to me.

> That flag would just mean "There is a row for the same path with a
> smaller op_depth and with a non-negative kind of presence", right?  So
> whether we actually store that flag is a matter of impl/efficiency, not
> of logical design.  Have I understood?

No. It means that this (relpath/op_depth) layer was created by a
deletion ("svn rm" or "svn mv").

Additional operations (add, copy-here, move-here) might alter this
layer, but it started with the deletion of a prior node/subtree.

> The subject that this arose from was how to store all the possible
> states of a working row.  First I want to know what are all the possible
> states of a working row that we need to represent, before we decide how
> to represent them.  I don't think we have ever written them down yet, in
> full detail, so I have tried.
>
> Please see the two tables in the "nodes-states" document that I am
> attaching as .ods and as .pdf, and as two .png images.  I'm not sure
> whether any of the attachments will get through to the list.

These tables are so much more complicated that it is very hard to
read/understand them. It is billowing the combinatorics of what seems
pretty darned simple.

op_depth==0 can have the following presence values: normal,
not-present, excluded, incomplete, unauthz. prior_deleted is always 0
(FALSE).

op_depth > 0 can have the following presence values: added, deleted,
copied-here, moved-here, excluded, incomplete. prior_deleted may be 0
or 1. moved_to is a discriminator between deleted and moved-away.

>...
> I assume this is in conjunction with the current set of presence values,
> not your proposed extended set.  So the possible changes would be
> encoded as:
>

Did you omit something here?

(I got the email as you sent it directly to me; no missing
attachments; it seems like above is missing something tho)

>> When a deletion (of a subtree) occurs, then we create a new layer at
>> <relpath, op_depth>. New rows are written for the root, and all
>> children, using that op_depth value. If this is a moved-away, then we
>> store the destination into moved_to at the root *only* (which can then
>> later discriminate between the two types of deletions; children need
>> to look to the root to discriminate; I bet this need is rare). Note
>> that the deletion process needs to look for mods to descendents:
>> deletes are integrated into this one; other operations may error with
>> "can't delete local mods" or somesuch.
>>
>> For the following actions, these are applied to the root of a deletion:
>
> What do you mean "these are applied to the root of a deletion"?  I guess
> "add", "copy-here", "move-here" can only be applied to the root of a
> deletion or to an unversioned/not-present path; is that it?

Correct. You cannot add/copy-here/move-here to a descendent of the
root of a deletion -- the root is still missing, so you have no parent
for the new nodes.

>> If an add occurs, then the root is updated to set presence='added'. No
>> other changes are needed.
>
> Apart from setting the new node kind.  And apart from changing the
> op_depth of all its still-deleted children to obey the deletion-op-depth
> rule:
>
>  checkout: (A/B, A/B/C, A/B/gamma),op_d=0,normal
>
>  delete A/B:  add rows (A/B, A/B/C, A/B/gamma),op_d=2,deleted

clarify as: presence=deleted. prior_deleted=TRUE.

>  add new file B:  modify row (A/B),op_d=2:
>                       presence/status := deleted+added
>                       kind := file

Yes.

>                   modify rows (A/B/C, A/B/gamma),op_d=2:
>                       op_d := 3

No.

The children don't have to be touched. They just hang out in their
deleted state with the same op_depth. We *never* want to modify a rows
op_depth. That is part of its primary key(!).

If you changed your operation above to: add subtree B, B/C, B/gamma,
then you would have:

1. modify <A/B, 2>: presence=added. (and kind=dir)
2. add <A/B/C, 3>: presence=added. kind=dir.
3. add <A/B/gamma, 3>: presence=added. kind=file.

The B/C and B/gamma nodes are at a higher op_depth, so they are now
visible instead of the deleted nodes.

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Posted by Julian Foad <ju...@wandisco.com>.

Greg Stein wrote:
> After working through the several email messages, and discussion, I
> believe we're now down to a simple change:
> 
> * add a "prior_deleted" flag to NODES
> 
> The flag simply means that a node exists prior to this layer and has
> been deleted or moved-away. The 'presence' column may say the same
> thing, but it might also describe data that is replacing the
> deletion/move.

Do you see this working in conjunction with the current set of presence
values, or your proposed extended set?

That flag would just mean "There is a row for the same path with a
smaller op_depth and with a non-negative kind of presence", right?  So
whether we actually store that flag is a matter of impl/efficiency, not
of logical design.  Have I understood?

The subject that this arose from was how to store all the possible
states of a working row.  First I want to know what are all the possible
states of a working row that we need to represent, before we decide how
to represent them.  I don't think we have ever written them down yet, in
full detail, so I have tried.

Please see the two tables in the "nodes-states" document that I am
attaching as .ods and as .pdf, and as two .png images.  I'm not sure
whether any of the attachments will get through to the list.

Table 1

The first table enumerates all the states of a row in NODES,ignoring any
"prior-deleted" or "moved-away" part of the state if the node has also
been replaced.  It shows whether each such state can exist in BASE
(op_depth = 0) and/or in WORKING (op_depth > 0) rows.

The remaining columns are works in progress.  The "Can be excluded?"
column starts to address the question "Can we copy or move a tree that
contains an excluded node?"

Table 2

The second table starts to define the state that results from applying
any possible structural change to a node.

I assume this is in conjunction with the current set of presence values,
not your proposed extended set.  So the possible changes would be
encoded as:

> When a deletion (of a subtree) occurs, then we create a new layer at
> <relpath, op_depth>. New rows are written for the root, and all
> children, using that op_depth value. If this is a moved-away, then we
> store the destination into moved_to at the root *only* (which can then
> later discriminate between the two types of deletions; children need
> to look to the root to discriminate; I bet this need is rare). Note
> that the deletion process needs to look for mods to descendents:
> deletes are integrated into this one; other operations may error with
> "can't delete local mods" or somesuch.
> 
> For the following actions, these are applied to the root of a deletion:

What do you mean "these are applied to the root of a deletion"?  I guess
"add", "copy-here", "move-here" can only be applied to the root of a
deletion or to an unversioned/not-present path; is that it?

> If an add occurs, then the root is updated to set presence='added'. No
> other changes are needed.

Apart from setting the new node kind.  And apart from changing the
op_depth of all its still-deleted children to obey the deletion-op-depth
rule:

  checkout: (A/B, A/B/C, A/B/gamma),op_d=0,normal

  delete A/B:  add rows (A/B, A/B/C, A/B/gamma),op_d=2,deleted

  add new file B:  modify row (A/B),op_d=2:
                       presence/status := deleted+added
                       kind := file
                   modify rows (A/B/C, A/B/gamma),op_d=2:
                       op_d := 3

- Julian

> If a copied-here or moved-here occurs, then the root is updated with
> the appropriate status and source information. Child nodes *may* have
> their presence switched from 'deleted' to 'copied-here' or
> 'moved-here' (depends on whether the arriving nodes intersect with the
> old namespace). New nodes may be introduced, with presence=$whatever
> and prior_deleted=0 (FALSE)
> 
> If a deletion of a child (subtree) of copied-here or moved-here
> occurs, then it has a new op_depth and defines a whole new layer. The
> "prior_deleted" is set to 1 (TRUE) indicating the prior layer (which
> happens to be the copy/move-here) has been deleted.
> 
> Deletion of an add is effectively a revert. If this is a child, then
> the layer is simply removed (it only has one node). If the
> deletion/revert of an add has prior_deleted=1 (meaning it is a root),
> then the node is rewritten to presence='deleted', restoring it to the
> state when the deletion first occurred. (and yes, a second revert
> undoes the deletion, etc...).
> 
> Reverting a child of a moved/copied-here tree is invalid. When you
> revert the root, then the children at this op_depth are traversed: any
> nodes with prior_deleted=1 are restored to presence=deleted, and nodes
> with prior_deleted=0 (newly-arrived from the copy/move) are simply
> removed.
> 
> Note that prior_deleted is set to TRUE only for a deletion operation
> (when presence is set to 'deleted'). That implies a prior node
> existed. For the sequence [rm A/B, add A/B, add A/B/foo], the node
> A/B/foo will have op_depth=3 and prior_deleted=0 since the row was
> created by an add. Assuming that A/B/foo existed originally, then
> prior_deleted=1 at <A/B/foo, op_depth=2>.
> 
> 
> I think that is it. Summarized a bit better from the earlier thread.
> 
> Cheers,
> -g

Re: add NODES.prior_deleted boolean column

Posted by Julian Foad <ju...@wandisco.com>.

Greg Stein wrote:
> After working through the several email messages, and discussion, I
> believe we're now down to a simple change:
> 
> * add a "prior_deleted" flag to NODES
> 
> The flag simply means that a node exists prior to this layer and has
> been deleted or moved-away. The 'presence' column may say the same
> thing, but it might also describe data that is replacing the
> deletion/move.

Do you see this working in conjunction with the current set of presence
values, or your proposed extended set?

That flag would just mean "There is a row for the same path with a
smaller op_depth and with a non-negative kind of presence", right?  So
whether we actually store that flag is a matter of impl/efficiency, not
of logical design.  Have I understood?

The subject that this arose from was how to store all the possible
states of a working row.  First I want to know what are all the possible
states of a working row that we need to represent, before we decide how
to represent them.  I don't think we have ever written them down yet, in
full detail, so I have tried.

Please see the two tables in the "nodes-states" document that I am
attaching as .ods and as .pdf, and as two .png images.  I'm not sure
whether any of the attachments will get through to the list.

Table 1

The first table enumerates all the states of a row in NODES,ignoring any
"prior-deleted" or "moved-away" part of the state if the node has also
been replaced.  It shows whether each such state can exist in BASE
(op_depth = 0) and/or in WORKING (op_depth > 0) rows.

The remaining columns are works in progress.  The "Can be excluded?"
column starts to address the question "Can we copy or move a tree that
contains an excluded node?"

Table 2

The second table starts to define the state that results from applying
any possible structural change to a node.

I assume this is in conjunction with the current set of presence values,
not your proposed extended set.  So the possible changes would be
encoded as:

> When a deletion (of a subtree) occurs, then we create a new layer at
> <relpath, op_depth>. New rows are written for the root, and all
> children, using that op_depth value. If this is a moved-away, then we
> store the destination into moved_to at the root *only* (which can then
> later discriminate between the two types of deletions; children need
> to look to the root to discriminate; I bet this need is rare). Note
> that the deletion process needs to look for mods to descendents:
> deletes are integrated into this one; other operations may error with
> "can't delete local mods" or somesuch.
> 
> For the following actions, these are applied to the root of a deletion:

What do you mean "these are applied to the root of a deletion"?  I guess
"add", "copy-here", "move-here" can only be applied to the root of a
deletion or to an unversioned/not-present path; is that it?

> If an add occurs, then the root is updated to set presence='added'. No
> other changes are needed.

Apart from setting the new node kind.  And apart from changing the
op_depth of all its still-deleted children to obey the deletion-op-depth
rule:

  checkout: (A/B, A/B/C, A/B/gamma),op_d=0,normal

  delete A/B:  add rows (A/B, A/B/C, A/B/gamma),op_d=2,deleted

  add new file B:  modify row (A/B),op_d=2:
                       presence/status := deleted+added
                       kind := file
                   modify rows (A/B/C, A/B/gamma),op_d=2:
                       op_d := 3

- Julian

> If a copied-here or moved-here occurs, then the root is updated with
> the appropriate status and source information. Child nodes *may* have
> their presence switched from 'deleted' to 'copied-here' or
> 'moved-here' (depends on whether the arriving nodes intersect with the
> old namespace). New nodes may be introduced, with presence=$whatever
> and prior_deleted=0 (FALSE)
> 
> If a deletion of a child (subtree) of copied-here or moved-here
> occurs, then it has a new op_depth and defines a whole new layer. The
> "prior_deleted" is set to 1 (TRUE) indicating the prior layer (which
> happens to be the copy/move-here) has been deleted.
> 
> Deletion of an add is effectively a revert. If this is a child, then
> the layer is simply removed (it only has one node). If the
> deletion/revert of an add has prior_deleted=1 (meaning it is a root),
> then the node is rewritten to presence='deleted', restoring it to the
> state when the deletion first occurred. (and yes, a second revert
> undoes the deletion, etc...).
> 
> Reverting a child of a moved/copied-here tree is invalid. When you
> revert the root, then the children at this op_depth are traversed: any
> nodes with prior_deleted=1 are restored to presence=deleted, and nodes
> with prior_deleted=0 (newly-arrived from the copy/move) are simply
> removed.
> 
> Note that prior_deleted is set to TRUE only for a deletion operation
> (when presence is set to 'deleted'). That implies a prior node
> existed. For the sequence [rm A/B, add A/B, add A/B/foo], the node
> A/B/foo will have op_depth=3 and prior_deleted=0 since the row was
> created by an add. Assuming that A/B/foo existed originally, then
> prior_deleted=1 at <A/B/foo, op_depth=2>.
> 
> 
> I think that is it. Summarized a bit better from the earlier thread.
> 
> Cheers,
> -g