You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Julian Foad <ju...@btopenworld.com> on 2012/09/13 22:18:01 UTC

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

Paul Burba wrote:

> [1] https://svn.apache.org/repos/asf/subversion/branches/inheritable-props

This branch caches properties of certain repository nodes, in a new place in the WC DB.

I haven't examined it at all yet.  The Wiki page says: "The cache will be stored in a new BLOB column in the NODES table called 'inherited_props'.  If the node is the base node of a WC root then this column is populated with a serialized skel of the node's inherited properties.  In all other cases it is NULL."

Another new special column in the NODES table... the twenty-third column.  Do we have to?  Is it not complex enough already?

I want to make a radical suggestion.

What would it take to be able to share some data instead of creating a special-purpose column just for this cache?  Well, it would take more work initially, but hear me out, as I have a hunch it may make less work in the end.

I think we would do well to factor out the storage of all data about a repository node-rev, moving that into a separate table that is *referenced* from the NODES table.  That new table would be indexed by node-rev (that is: repos_id, rev, 
relpath) and have a column for the properties and perhaps other columns 
for other attributes of the node-rev.

What we need to know here (for inherited props) is the properties set on each ancestor of ever WC "root node" (which is defined elsewhere and includes switched nodes etc.)  That's not a complex requirement.  Instead of creating a new column dedicated to storing the union of all the props inherited from every ancestor node, how about we fetch and concatenate, on request, the node-rev table rows for all the pathwise ancestors of the desired node.  The cache maintenance then becomes ensuring that a set of node-rev table rows are populated (populated in the standard way for such rows), and that those rows are removed when no longer referenced.

But this particular use is not the only reason for factoring out this data into a separate table.  There is:

  - For inherited props, this use.

  - For inherited props, if the current model of BASE node inheritance remains, we're also going to need to cache the inherited props for every mixed-rev node (that is, every node whose rev != its parent-in-BASE rev).  (I'll write separately about why we need this for off-line operation.)

  - We already have (I think) the case where a local WC-to-WC copy or move of a tree results in poulating another set of tree nodes in the NODES table with an identical copy of all the data.  Whenever we can reduce the amount of such copying, if the additional complexity is not too great, that can make it easier to achieve both correctness and efficiency.

  - For conflicts, it's useful to store the "theirs" and "old base" 
versions of the conflicted node.  Its kind, its properties, the reference to its (pristine) text, and perhaps other attributes.

  - I think it would make the WC *much* more maintainable.  At the moment, there doesn't appear to be any codified connection between those eight columns in NODES.  There doesn't even seem to be any comment in the doc strings to indicate that.  It's almost impossible to analyze the huge query statements to check that they're all updated together.

Therefore I propose a WC development that adds a new table (call it NODEREV?), indexed by repos/rev/relpath, with eight further columns that will be moved out of the current NODES table:

  repos_id, revision, repos_path -- the index columns
  kind
  properties
  checksum
  symlink_target
  changed_revision
  changed_date
  changed_author
  translated_size

Then we add an indirection from the NODES table via the index columns whenever we access those columns.  And some means of controlling creation and deletion of those rows, perhaps using a ref-count.

I acknowledge that such a development is far from trivial, and I don't expect Paul to do it as part of this work.  I just want to get the proposal out there and see if there is any consensus for working toward something like it.

In terms of the inheritable props branch, I wonder if there's any way we can move a bit closer to this proposal without going the whole hog, but I can't think of anything.

- Julian

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

Posted by Julian Foad <ju...@btopenworld.com>.

Bert: Sorry for getting emotionally worked up in my last response.  I know you're only expressing your concern about maintaining speed, and that's fine.

- Julian



----- Original Message -----
> From: Julian Foad <ju...@btopenworld.com>
> To: Bert Huijben <be...@qqmail.nl>
> Cc: 'Paul Burba' <pt...@gmail.com>; 'Subversion Development' <de...@subversion.apache.org>
> Sent: Thursday, 13 September 2012, 18:27
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
> 
> Bert Huijben wrote:
> 
>>>   From: Julian Foad [mailto:julianfoad@btopenworld.com]
>>> 
>>>   This branch caches properties of certain repository nodes, in a new 
> place
>>>  in  the WC DB.
>>> 
>>>   I haven't examined it at all yet.  The Wiki page says: "The 
> cache will be
>>>  stored  in a new BLOB column in the NODES table called 
> 'inherited_props'.  
>>>  If the  node is the base node of a WC root then this column is 
> populated
>>>  with a  serialized skel of the node's inherited properties.  In all 
> other
>>>  cases  it is  NULL."
>>> 
>>>   Another new special column in the NODES table... the twenty-third
>>>   column.  Do we have to?  Is it not complex enough already?
>>> 
>>>   I want to make a radical suggestion.
>>> 
>>>   What would it take to be able to share some data instead of creating a
>>>   special-purpose column just for this cache?  Well, it would take more 
> work
>>>   initially, but hear me out, as I have a hunch it may make less work in 
> the
>>>  end.
>>> 
>>>   I think we would do well to factor out the storage of all data about a
>>>   repository node-rev, moving that into a separate table that is
>>>  *referenced*  from the NODES table.  That new table would be indexed by
>>>  node-rev (that  is: repos_id, rev,
>>>   relpath) and have a column for the properties and perhaps other 
> columns
>>>   for other attributes of the node-rev.
>> 
>>  That would be two additional table lookups per retrieval inside Sqlite3,
>>  [...] Have you investigated the performance costs of this proposal?
> 
> No.
> 
>>  I heard similar suggestions about properties before, but additional table
>>  lookups are certainly not free in Sqlite. They are not expensive by itself,
>>  but every extra db transaction is certainly a performance hit.
>> 
>>>   What we need to know here (for inherited props) is the properties set 
> on
>>>   each ancestor of ever WC "root node" (which is defined 
> elsewhere and
>>>   includes switched nodes etc.)  That's not a complex requirement.  
> Instead
>>>  of  creating a new column dedicated to storing the union of all the 
> props
>>>   inherited from every ancestor node, how about we fetch and 
> concatenate,
>>>   on request, the node-rev table rows for all the pathwise ancestors of 
> the
>>>   desired node.  The cache maintenance then becomes ensuring that a set 
> of
>>>   node-rev table rows are populated (populated in the standard way for 
> such
>>>   rows), and that those rows are removed when no longer referenced.
>>> 
>>>   But this particular use is not the only reason for factoring out this 
> data
>>>  into a  separate table.  There is:
>>> 
>>>     - For inherited props, this use.
>>> 
>>>     - For inherited props, if the current model of BASE node inheritance
>>>   remains, we're also going to need to cache the inherited props for 
> every
>>>   mixed-rev node (that is, every node whose rev != its parent-in-BASE
>>>   rev).  (I'll write separately about why we need this for off-line
>>>  operation.)
>> 
>>  I think this case was explicitly ignored.
>> 
>>  This would require that operations like commit start updating parts of the
>>  working copy with information that is not available locally.
> 
> Yes.  That's not a problem, is it?
> 
>>  Normally the commit shouldn't be able to introduce invalid inherited 
> state
>>  (or the commit would fail).
> 
> Pardon?  I don't understand what you're saying there.
> 
>>  And an explicit update of a subtree already adds the necessary information
>>  to the root (if the current state of the code still matches what I read
>>  before)
>> 
>>>     - We already have (I think) the case where a local WC-to-WC copy or 
> move
>>>   of a tree results in poulating another set of tree nodes in the NODES 
> table
>>>   with an identical copy of all the data.  Whenever we can reduce the 
> amount
>>>   of such copying, if the additional complexity is not too great, that 
> can
>>>  make it  easier to achieve both correctness and efficiency.
>> 
>>  When we copy from BASE to another layer we don't have to copy the 
> inherited
>>  properties. The inheritance in the new location will be based on the new
>>  url, not the original location. So the current queries already handle this
>>  case by not copying the data from the BASE layer.
> 
> I'm talking there about regular, explicit properties, not inherited 
> properties.
> 
> 
>>>     - For conflicts, it's useful to store the "theirs" and 
> "old base"
>>>   versions of the conflicted node.  Its kind, its properties, the 
> reference
>>>  to its  (pristine) text, and perhaps other attributes.
>> 
>>  On trunk everything is prepared to store conflicts in a skel, what has this
>>  to do with the inherited property storage?
> 
> It has nothing to do with inherited properties except that inherited properties 
> and conflicts could both refer to this table of cached node-revs.
> 
>>  In format 30 we already have the urls and properties of the conflicted 
> nodes
>>  (not the pristine text yet, but with the url we can retrieve that or we can
>>  implement storing the sha1 references in the separate columns as initially
>>  planned)
> 
> If we go this way, then I would suggest we migrate the conflict storage skel to 
> use (repos_id, rev, relpath) references to node-rev rows instead of storing the 
> props and other bits of metadata explicitly.
> 
> 
>>>     - I think it would make the WC *much* more maintainable.  At the 
> moment,
>>>   there doesn't appear to be any codified connection between those 
> eight
>>>   columns in NODES.  There doesn't even seem to be any comment in 
> the doc
>>>   strings to indicate that.  It's almost impossible to analyze the 
> huge
>>>  query  statements to check that they're all updated together.
>>> 
>>>   Therefore I propose a WC development that adds a new table (call it
>>>   NODEREV?), indexed by repos/rev/relpath, with eight further columns 
> that
>>>   will be moved out of the current NODES table:
>>> 
>>>     repos_id, revision, repos_path -- the index columns
>>>     kind
>>>     properties
>>>     checksum
>>>     symlink_target
>>>     changed_revision
>>>     changed_date
>>>     changed_author
>>>     translated_size
>> 
>>  (translated_size or really recorded_size shouldn't be in this list as 
> it can
>>  have a different value for the same file if it is checked out multiple
>>  times)
> 
> Oh, how can that be?  Oh, you mean the documentation is wrong?  Well, fancy 
> that.
> 
> 
>>>   Then we add an indirection from the NODES table via the index columns
>>>   whenever we access those columns.  And some means of controlling
>>>   creation and deletion of those rows, perhaps using a ref-count.
>> 
>>  Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries 
> and
>>  even more for cascading updates to provider better maintainability, while a
>>  typical checkout working copy only contains nodes once?
>> 
>>  I did a lot of performance work to get at the overall 15% speedup we have
>>  since 1.7. Do you have any ideas on how much of that we will lose after 
> this
>>  suggested refactoring?
> 
> No, I have no idea.  Guess... guess... 0.1%?
> 
> Seriously, please let me put out a design refactoring proposal without beating 
> me down with the "slowness" argument.  I know speed is important, but 
> at the moment I'm concerned WC maintenance is soon going to die.  To me it 
> looks just as bad (in terms of undocumented complexity) as WC-1.0.  It's a 
> good effort, don't get me wrong -- I applaud the success of the WC rewrite 
> and the work you've put in to make it fast as well as correct.  That's 
> all great, but I occasionally look at the code and the (non-)documentation and 
> close the file again and leave it to someone else.
> 
> If the idea is rubbish because it has unfeasible performance, then fine, we 
> won't do it.  I'm sure you know much more than I do about WC DB 
> performance, so maybe you're right.
> 
> I *know* you and others are concerned about "performance" (I put it in 
> 
> quotes because it's referring to just one kind of performance: that is, 
> speed of performing a given operation).  Sure that kind of performance is 
> important, I don't deny it, and I don't want to needlessly reduce it 
> either.
> 
> But how about recognizing that there's some value to my suggestion and 
> seeking a way to keep the code-organization value without hitting the 
> performance penalty, instead of just hitting me with "doing that will add 
> some run-time cost" implying that's so wrong it doesn'y bear 
> thinking about.
> 
> Maybe we could make a tiny but  constructive start by adding a note in the 
> documentation that those columns all belong "together" in the sense of 
> being attributes of a node-rev, and maybe use some C preprocessor macros to 
> start handling them all together where it makes sense, instead of leaving 
> developers to try to remember which eight of the twenty-three arguments to a 
> SELECT statement or a 'read_info' call should all be null or all be 
> non-null together.
> 
> 
>>  I'm no fan of adding yet another layer in the working copy to just
>>  theoretically improve maintainability. That layer will also require
>>  maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
>>  we are slowly moving to a decentralized DVCS and maybe the storage should
>>  then also move to such a model.
> 
> Sorry, I'm not following.
> 
>>>   I acknowledge that such a development is far from trivial, and I 
> don't
>>>  expect  Paul to do it as part of this work.  I just want to get the
>>>  proposal out there  and see if there is any consensus for working 
> toward
>>>  something like it.
>>> 
>>>   In terms of the inheritable props branch, I wonder if there's any 
> way 
>>>  we can  move a bit closer to this proposal without going the whole hog,
>>>  but I can't  think of anything.
>> 
>>  I don't think this proposal, the inherited properties and/or the
>>  conflicts-skels proposals are directly related.
>> 
>> 
>>  And before we look at this or similar refactorings I think we should 
> measure
>>  the impact on some 'simple' case like 'svn status', that 
> would 
>>  have at least
>>  an additional lookup for every layer of every node for the foreign key
>>  handling.
>> 
>>      Bert
> 
> - Julian
>

Re: [RFC] Inheritable Properties Branch -- caching props in WC DB

Posted by Julian Foad <ju...@btopenworld.com>.

Bert Huijben wrote:

>>  From: Julian Foad [mailto:julianfoad@btopenworld.com]
>> 
>>  This branch caches properties of certain repository nodes, in a new place
>> in  the WC DB.
>> 
>>  I haven't examined it at all yet.  The Wiki page says: "The cache will be
>> stored  in a new BLOB column in the NODES table called 'inherited_props'.  
>> If the  node is the base node of a WC root then this column is populated
>> with a  serialized skel of the node's inherited properties.  In all other
>> cases  it is  NULL."
>> 
>>  Another new special column in the NODES table... the twenty-third
>>  column.  Do we have to?  Is it not complex enough already?
>> 
>>  I want to make a radical suggestion.
>> 
>>  What would it take to be able to share some data instead of creating a
>>  special-purpose column just for this cache?  Well, it would take more work
>>  initially, but hear me out, as I have a hunch it may make less work in the
>> end.
>> 
>>  I think we would do well to factor out the storage of all data about a
>>  repository node-rev, moving that into a separate table that is
>> *referenced*  from the NODES table.  That new table would be indexed by
>> node-rev (that  is: repos_id, rev,
>>  relpath) and have a column for the properties and perhaps other columns
>>  for other attributes of the node-rev.
> 
> That would be two additional table lookups per retrieval inside Sqlite3,
> [...] Have you investigated the performance costs of this proposal?

No.

> I heard similar suggestions about properties before, but additional table
> lookups are certainly not free in Sqlite. They are not expensive by itself,
> but every extra db transaction is certainly a performance hit.
> 
>>  What we need to know here (for inherited props) is the properties set on
>>  each ancestor of ever WC "root node" (which is defined elsewhere and
>>  includes switched nodes etc.)  That's not a complex requirement.  Instead
>> of  creating a new column dedicated to storing the union of all the props
>>  inherited from every ancestor node, how about we fetch and concatenate,
>>  on request, the node-rev table rows for all the pathwise ancestors of the
>>  desired node.  The cache maintenance then becomes ensuring that a set of
>>  node-rev table rows are populated (populated in the standard way for such
>>  rows), and that those rows are removed when no longer referenced.
>> 
>>  But this particular use is not the only reason for factoring out this data
>> into a  separate table.  There is:
>> 
>>    - For inherited props, this use.
>> 
>>    - For inherited props, if the current model of BASE node inheritance
>>  remains, we're also going to need to cache the inherited props for every
>>  mixed-rev node (that is, every node whose rev != its parent-in-BASE
>>  rev).  (I'll write separately about why we need this for off-line
>> operation.)
> 
> I think this case was explicitly ignored.
> 
> This would require that operations like commit start updating parts of the
> working copy with information that is not available locally.

Yes.  That's not a problem, is it?

> Normally the commit shouldn't be able to introduce invalid inherited state
> (or the commit would fail).

Pardon?  I don't understand what you're saying there.

> And an explicit update of a subtree already adds the necessary information
> to the root (if the current state of the code still matches what I read
> before)
> 
>>    - We already have (I think) the case where a local WC-to-WC copy or move
>>  of a tree results in poulating another set of tree nodes in the NODES table
>>  with an identical copy of all the data.  Whenever we can reduce the amount
>>  of such copying, if the additional complexity is not too great, that can
>> make it  easier to achieve both correctness and efficiency.
> 
> When we copy from BASE to another layer we don't have to copy the inherited
> properties. The inheritance in the new location will be based on the new
> url, not the original location. So the current queries already handle this
> case by not copying the data from the BASE layer.

I'm talking there about regular, explicit properties, not inherited properties.


>>    - For conflicts, it's useful to store the "theirs" and "old base"
>>  versions of the conflicted node.  Its kind, its properties, the reference
>> to its  (pristine) text, and perhaps other attributes.
> 
> On trunk everything is prepared to store conflicts in a skel, what has this
> to do with the inherited property storage?

It has nothing to do with inherited properties except that inherited properties and conflicts could both refer to this table of cached node-revs.

> In format 30 we already have the urls and properties of the conflicted nodes
> (not the pristine text yet, but with the url we can retrieve that or we can
> implement storing the sha1 references in the separate columns as initially
> planned)

If we go this way, then I would suggest we migrate the conflict storage skel to use (repos_id, rev, relpath) references to node-rev rows instead of storing the props and other bits of metadata explicitly.


>>    - I think it would make the WC *much* more maintainable.  At the moment,
>>  there doesn't appear to be any codified connection between those eight
>>  columns in NODES.  There doesn't even seem to be any comment in the doc
>>  strings to indicate that.  It's almost impossible to analyze the huge
>> query  statements to check that they're all updated together.
>> 
>>  Therefore I propose a WC development that adds a new table (call it
>>  NODEREV?), indexed by repos/rev/relpath, with eight further columns that
>>  will be moved out of the current NODES table:
>> 
>>    repos_id, revision, repos_path -- the index columns
>>    kind
>>    properties
>>    checksum
>>    symlink_target
>>    changed_revision
>>    changed_date
>>    changed_author
>>    translated_size
> 
> (translated_size or really recorded_size shouldn't be in this list as it can
> have a different value for the same file if it is checked out multiple
> times)

Oh, how can that be?  Oh, you mean the documentation is wrong?  Well, fancy that.


>>  Then we add an indirection from the NODES table via the index columns
>>  whenever we access those columns.  And some means of controlling
>>  creation and deletion of those rows, perhaps using a ref-count.
> 
> Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries and
> even more for cascading updates to provider better maintainability, while a
> typical checkout working copy only contains nodes once?
> 
> I did a lot of performance work to get at the overall 15% speedup we have
> since 1.7. Do you have any ideas on how much of that we will lose after this
> suggested refactoring?

No, I have no idea.  Guess... guess... 0.1%?

Seriously, please let me put out a design refactoring proposal without beating me down with the "slowness" argument.  I know speed is important, but at the moment I'm concerned WC maintenance is soon going to die.  To me it looks just as bad (in terms of undocumented complexity) as WC-1.0.  It's a good effort, don't get me wrong -- I applaud the success of the WC rewrite and the work you've put in to make it fast as well as correct.  That's all great, but I occasionally look at the code and the (non-)documentation and close the file again and leave it to someone else.

If the idea is rubbish because it has unfeasible performance, then fine, we won't do it.  I'm sure you know much more than I do about WC DB performance, so maybe you're right.

I *know* you and others are concerned about "performance" (I put it in 
quotes because it's referring to just one kind of performance: that is, 
speed of performing a given operation).  Sure that kind of performance is important, I don't deny it, and I don't want to needlessly reduce it either.

But how about recognizing that there's some value to my suggestion and seeking a way to keep the code-organization value without hitting the performance penalty, instead of just hitting me with "doing that will add some run-time cost" implying that's so wrong it doesn'y bear thinking about.

Maybe we could make a tiny but  constructive start by adding a note in the documentation that those columns all belong "together" in the sense of being attributes of a node-rev, and maybe use some C preprocessor macros to start handling them all together where it makes sense, instead of leaving developers to try to remember which eight of the twenty-three arguments to a SELECT statement or a 'read_info' call should all be null or all be non-null together.


> I'm no fan of adding yet another layer in the working copy to just
> theoretically improve maintainability. That layer will also require
> maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
> we are slowly moving to a decentralized DVCS and maybe the storage should
> then also move to such a model.

Sorry, I'm not following.

>>  I acknowledge that such a development is far from trivial, and I don't
>> expect  Paul to do it as part of this work.  I just want to get the
>> proposal out there  and see if there is any consensus for working toward
>> something like it.
>> 
>>  In terms of the inheritable props branch, I wonder if there's any way 
>> we can  move a bit closer to this proposal without going the whole hog,
>> but I can't  think of anything.
> 
> I don't think this proposal, the inherited properties and/or the
> conflicts-skels proposals are directly related.
> 
> 
> And before we look at this or similar refactorings I think we should measure
> the impact on some 'simple' case like 'svn status', that would 
> have at least
> an additional lookup for every layer of every node for the foreign key
> handling.
> 
>     Bert

- Julian

RE: [RFC] Inheritable Properties Branch -- caching props in WC DB

Posted by Bert Huijben <be...@qqmail.nl>.


> -----Original Message-----
> From: Julian Foad [mailto:julianfoad@btopenworld.com]
> Sent: donderdag 13 september 2012 22:18
> To: Paul Burba
> Cc: Subversion Development
> Subject: Re: [RFC] Inheritable Properties Branch -- caching props in WC DB
> 
> Paul Burba wrote:
> 
> > [1] https://svn.apache.org/repos/asf/subversion/branches/inheritable-
> props
> 
> This branch caches properties of certain repository nodes, in a new place
in
> the WC DB.
> 
> I haven't examined it at all yet.  The Wiki page says: "The cache will be
stored
> in a new BLOB column in the NODES table called 'inherited_props'.  If the
> node is the base node of a WC root then this column is populated with a
> serialized skel of the node's inherited properties.  In all other cases it
is
> NULL."
> 
> Another new special column in the NODES table... the twenty-third
> column.  Do we have to?  Is it not complex enough already?
> 
> I want to make a radical suggestion.
> 
> What would it take to be able to share some data instead of creating a
> special-purpose column just for this cache?  Well, it would take more work
> initially, but hear me out, as I have a hunch it may make less work in the
end.
> 
> I think we would do well to factor out the storage of all data about a
> repository node-rev, moving that into a separate table that is
*referenced*
> from the NODES table.  That new table would be indexed by node-rev (that
> is: repos_id, rev,
> relpath) and have a column for the properties and perhaps other columns
> for other attributes of the node-rev.

That would be two additional table lookups per retrieval inside Sqlite3,
while an empty column costs negligible storage. A primary key that is not an
unique integer requires an additional lookup outside the extra lookup for
the foreign key handling.

Have you investigated the performance costs of this proposal?

I heard similar suggestions about properties before, but additional table
lookups are certainly not free in Sqlite. They are not expensive by itself,
but every extra db transaction is certainly a performance hit.

> 
> What we need to know here (for inherited props) is the properties set on
> each ancestor of ever WC "root node" (which is defined elsewhere and
> includes switched nodes etc.)  That's not a complex requirement.  Instead
of
> creating a new column dedicated to storing the union of all the props
> inherited from every ancestor node, how about we fetch and concatenate,
> on request, the node-rev table rows for all the pathwise ancestors of the
> desired node.  The cache maintenance then becomes ensuring that a set of
> node-rev table rows are populated (populated in the standard way for such
> rows), and that those rows are removed when no longer referenced.
> 
> But this particular use is not the only reason for factoring out this data
into a
> separate table.  There is:
> 
>   - For inherited props, this use.
> 
>   - For inherited props, if the current model of BASE node inheritance
> remains, we're also going to need to cache the inherited props for every
> mixed-rev node (that is, every node whose rev != its parent-in-BASE
> rev).  (I'll write separately about why we need this for off-line
operation.)

I think this case was explicitly ignored.

This would require that operations like commit start updating parts of the
working copy with information that is not available locally. Normally the
commit shouldn't be able to introduce invalid inherited state (or the commit
would fail). 

And an explicit update of a subtree already adds the necessary information
to the root (if the current state of the code still matches what I read
before)

>   - We already have (I think) the case where a local WC-to-WC copy or move
> of a tree results in poulating another set of tree nodes in the NODES
table
> with an identical copy of all the data.  Whenever we can reduce the amount
> of such copying, if the additional complexity is not too great, that can
make it
> easier to achieve both correctness and efficiency.

When we copy from BASE to another layer we don't have to copy the inherited
properties. The inheritance in the new location will be based on the new
url, not the original location. So the current queries already handle this
case by not copying the data from the BASE layer.

>   - For conflicts, it's useful to store the "theirs" and "old base"
> versions of the conflicted node.  Its kind, its properties, the reference
to its
> (pristine) text, and perhaps other attributes.

On trunk everything is prepared to store conflicts in a skel, what has this
to do with the inherited property storage?

In format 30 we already have the urls and properties of the conflicted nodes
(not the pristine text yet, but with the url we can retrieve that or we can
implement storing the sha1 references in the separate columns as initially
planned)

> 
>   - I think it would make the WC *much* more maintainable.  At the moment,
> there doesn't appear to be any codified connection between those eight
> columns in NODES.  There doesn't even seem to be any comment in the doc
> strings to indicate that.  It's almost impossible to analyze the huge
query
> statements to check that they're all updated together.
> 
> Therefore I propose a WC development that adds a new table (call it
> NODEREV?), indexed by repos/rev/relpath, with eight further columns that
> will be moved out of the current NODES table:
> 
>   repos_id, revision, repos_path -- the index columns
>   kind
>   properties
>   checksum
>   symlink_target
>   changed_revision
>   changed_date
>   changed_author
>   translated_size

(translated_size or really recorded_size shouldn't be in this list as it can
have a different value for the same file if it is checked out multiple
times)

> Then we add an indirection from the NODES table via the index columns
> whenever we access those columns.  And some means of controlling
> creation and deletion of those rows, perhaps using a ref-count.

Ok, so we add a ... guess ... guess ... 30% slowdown of our read queries and
even more for cascading updates to provider better maintainability, while a
typical checkout working copy only contains nodes once?

I did a lot of performance work to get at the overall 15% speedup we have
since 1.7. Do you have any ideas on how much of that we will lose after this
suggested refactoring?


I'm no fan of adding yet another layer in the working copy to just
theoretically improve maintainability. That layer will also require
maintenance, while it is for 99.999% a 1 on 1 relation. And in other cases
we are slowly moving to a decentralized DVCS and maybe the storage should
then also move to such a model.

> I acknowledge that such a development is far from trivial, and I don't
expect
> Paul to do it as part of this work.  I just want to get the proposal out
there
> and see if there is any consensus for working toward something like it.
> 
> In terms of the inheritable props branch, I wonder if there's any way we
can
> move a bit closer to this proposal without going the whole hog, but I
can't
> think of anything.

I don't think this proposal, the inherited properties and/or the
conflicts-skels proposals are directly related.


And before we look at this or similar refactorings I think we should measure
the impact on some 'simple' case like 'svn status', that would have at least
an additional lookup for every layer of every node for the foreign key
handling.

	Bert


> - Julian