You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Sperling <st...@elego.de> on 2011/08/26 13:14:32 UTC

FSFS successor ID design draft

Below is an initial draft of a design for successor-IDs in FSFS.
It is a bit long, but I hope it covers all relevant points in
sufficient detail.

Please share your thoughts on this. I have most likely missed a few things.

If we like it I will put it into the repository for further editing,
either in notes/ on trunk or onto the fsfs-successor ID branch.

If this design has some severe problem I don't mind going back to
the drawing board. This is my first shot, afterall :)

cmpilato, does this significantly differ from what you've done for BDB?
Will both backends have the same (or similar) behaviour if we use
this design for FSFS?

Thanks to neels and danielsh for all your input so far.
You've been tremendously helpful!

I would also be happy to see alternative proposals.



FSFS successor IDs
==================

See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
for what this is all about.

What follows are ideas about implementing successor ID support for FSFS.

Useful background information:
  Node-revisions: subversion/libsvn_fs_base/notes/fs-history
                  subversion/libsvn_fs_base/notes/structure
  fsfs design: subversion/libsvn_fs_fs/structure


The "cache" and its properties
==============================

Storage of successor IDs is essentially a cache of the result of the
inversion of the predecessor relation between all node-revisions.
This inversion is a *very* expensive operation (follow every node
that ever existed in the repository from its most recent revision
back to the revision it was created in).

In the following, "cache" refers to the store for successor IDs.

After a general introduction, a proposed cache design is presented.
Note that the cache implementation will be hidden behind the API of
libsvn_fs_fs and can be improved or changed in every new minor
release of Subversion.

Required properties of successor ID cache for FSFS:
 - FSFS-compatible:
   - cache must use some plain file format
   - writers do not block readers (i.e. sqlite is out of the question)
   - cache must not rely on modifying existing revision files
 - cache can be regenerated on demand
 - do not corrupt the cache in case of an unexpected crash
 - try to keep number of directory entries low
 - conserve inodes by design (preferred), or via packing support

Nice-to-have properties:
 - the repository can be live during cache generation


Storage size requirements
=========================

Ballpark figures about node-revs from #svn-dev:
  * cmpilato does some math.  There are over 350,000 node revisions in my
      ancient copy of the svn.collab.net Subversion repos.
  <hwright> there are over 20M of them in the ASF repo

How many successors does a given node-rev have?

  There is at most one successor on the same line of history (same copy_id).
  Each copy operation also creates one new successor.

How big will the cache become, in total?

  The successor cache does not store entire node-revs, just node-rev IDs.
  These are generally not longer than 40 bytes (less than half a line
  in a 80x25 terminal being the rough estimate).

  The amount of raw information to store per node-revision is the
  size of this mapping:
    node_rev_id => successor_id

  Each entry is at least 80 bytes large, 40 bytes for the node_rev_id
  and 40 bytes for the successor ID.

  Each new node-revision adds one entry to the map.
  Since each successor is itself a node-revision, the entire cache
  has at most as many entries as the amount of node-revs in the
  repository.

  More precisely, the amount of successors listed in the cache is
  equal to the number of node revisions in the repository, minus
  the number of nodes in HEAD (which don't presently have successors)
  and minus the number of nodes which were deleted in history (they
  don't have successors either).
  
  In other words, the amount of successors in the cache is equal to
  the amount of pred: entries in revision files.

  Let's try to plug in some numbers from the above svn.collab.net case.

  There are 350000 node revs. 
    maximum size of successor IDs in cache = 40 bytes * 350000
    maximum size of node-rev IDs in cache = 40 bytes * 350000
    maximum size of cache = 40 bytes * 350000 * 2

  This amounts to roughly 28MB worst-case successor cache size for
  the svn.collab.net repository (which itself is about 500MB in size).


Implementation proposal
=======================


- On-disk File Format -

All files related to the cache are stored in the
repos/db/successor-id-cache directory ("cache directory").

There is one file which contains the number of the revision the
cache was last synced to: repos/db/successor-id-cache/cached-rev
This is used by both readers and writers.

The directory also contains files named after revisions,
and is sharded modulo some value TBD:

 successor-id-cache/0/1
 successor-id-cache/0/2
 successor-id-cache/0/3
 ...
 successor-id-cache/1/N
 successor-id-cache/1/M
 ...

These individual files store successors of all node-revs of a given revision
(or, optionally, a number of revisions -- see notes on packing below).

Each file starts with an index, followed by zero or more successors-blocks
("s-block" for short).

The format of the index is shown below. It contains the index size
(which is a fixed value) and one entry for each node_rev that appears
in the revision(s) the file is responsible for.

The index size and offsets are 64 bit values.
(32 bit index size would probably be sufficient, but let's not count peanuts.)
node_rev_id is a variable length ASCII string terminated by '\0'.

 +----------------------------------------------------------------------+
 |                           index size                                 |
 +----------------------------------------------------------------------+
 | node_rev_id | offset to my first s-block | offset to my last s-block |
 | node_rev_id | offset to my first s-block | offset to my last s-block |
 | node_rev_id | offset to my first s-block | offset to my last s-block |
 |                               ...                                    |    
 +----------------------------------------------------------------------+
 |               zero padding to multiple of s-block size               |
 +----------------------------------------------------------------------+

If a node-rev does not have any successors, both s-block offsets are zero.
This makes finding the answer to the question "does this node have any
successors?" an O(1) operation (only one disk block needs to be read).

If none of the node-revs have successors yet, the file ends here.

If a node-rev has successors, the file contains one or more s-blocks.
An s-block lists successors of the node-rev and provides some amount of
pre-allocated space for future successors. The offset to the first s-block
helps readers, the offset to the last s-block helps writers (the next
section provides details).

The padding at the end of the index ensures that all s-block boundaries
lie on disk sector boundaries (more details later).


The format of an s-block is shown below. It has a fixed size (SBLOCK_SIZE).
The offsets are 64 bit values. successor_id is a variable length ASCII
string terminated by '\0'.

 +----------------------------------------------------------------------+
 |                    offset of next s-block                            |
 +----------------------------------------------------------------------+
 |             offset of free space in this s-block                     |
 +----------------------------------------------------------------------+
 | successor_id, successor_id, ...                                      |
 +----------------------------------------------------------------------+
 |              free space zero-padded to SBLOCK_SIZE                   |
 +----------------------------------------------------------------------+

What is a good value for SBLOCK_SIZE?
Modern disks have a block size of 8192 bytes (some disks even use this
block size internally but report 512 bytes of block size to the OS).
Using this block size means that each s-block will fit into one sector
on modern disks, and into 16 sectors on older disks using 512 byte sectors.
In both cases an s-block can be read in a single sequential read.

Recall that a successor_id will take about 40 bytes. Let's round that
up to 64 for good measure. Applying advanced mathematics tells us that
8192 bytes will provide enough space for at least 128 successors of a
node-rev-id. Which means we'll have to allocate a new s-block about every
128 successors (usually a slightly higher amount of successors should fit).
If more successors must fit SBLOCK_SIZE should be some multiple of 8192.

If this s-block is the last s-block for this node-rev, the offset of
the next s-block is zero.

The offset of free space within the s-block is used by readers when
reading existing successors and by writers when appending new successors.
(Note that in the implementation the offset of free space might also
be a 16 bit number relative to the start of the s-block, instead of
being a 64 bit number. But let's ignore such optimisations for now.) 



- Procedure for Readers -

A reader first reads the cached_rev file and remembers the revision number.

It then locates the cache file corresponding to the given node_rev_id.
The name of the file is keyed on the revision of the node-rev. (Without
packing the file has the same name as the revision. With packing the
file is named after this revision modulo some fixed and known value.)

It opens this file and reads the index to learn the offset of the
first s-block of the node_rev_id.

If the offset is zero, the node-rev has no successors, and the search ends.

The reader seeks to the offset of the first s-block.

It reads the offset of free space within the s-block, and then reads
successors from the s-block up to this offset.

If the offset to the next s-block is zero, the search ends.
If the offset to the next s-block is non-zero, the reader
proceeds to the next s-block.

When the search ends, the reader returns the list of successors
and the number of the cached_rev, which indicates up to which
revision the successor list is valid. (This is a lower bound; the
successor list might already contain some values for newer revisions
if a writer is concurrently updating the cache. However, there is no
harm done if the reader uses this information because it pertains to
already committed revisions.)

Note that there are only O(number of s-blocks) disk seeks involved.
Any time data is read from disk one or more disk blocks are read sequentially.



- Procedure for Writers -

Note that the following describes how successor data for a single
node_rev_id is updated. This is only a sub-step of the global cache
synchronisation procedure (which is described in the next section).

A writer operates under the assumption that the entire cache directory
has been locked for writing.

The writer locates the cache file corresponding to the given node_rev_id.

The writer opens this file and reads the index to learn the offset of the
last s-block of the node_rev_id.

The writer seeks to the offset of the last s-block.

The writer reads the offset of remaining free space within the s-block,
and calculates the amount of free space left within the s-block (recall
that the s-block has fixed size SBLOCK_SIZE).

The writer writes successor IDs to free space within the s-block,
and flushes the cache file to disk.

The writer updates the offset of free space in the s-block, and
flushes the cache file to disk.

(At this point readers get to see the new successor IDs.)


At any point, if remaining free space is too small for all new successors,
the writer must allocate a new s-block.

To do so, it seeks to the very end of the file and appends a new s-block
which initially contains only zeros.

The writer writes successors to free space in the new s-block.
It updates the offset to free space in the new s-block, and flushes
the cache file to disk.

It updates the 'next s-block' offset in the previous s-block to the
offset of the new s-block, and flushes the cache file to disk.


If the writer crashes, the following cases can happen:

  - New successors have been added to an s-block, but the offset to
    free space has not been updated.
    => Newly added successors will be overwritten by the next writer,
       and readers don't see them.
  - A new s-block has been written, but the 'next s-block' offset at
    the previous s-block has not been updated.
    => An unused s-block will remain in the file.
       The next writer adding successors of the same node-rev will
       append a new s-block to the file. Readers don't see the unused s-block.
       If the cache is rebuilt from scratch the unused s-block disappears.

Thus, cache consistency is guaranteed even in case of crashes.



- Cache Synchronisation -

The procedure for synchronising the cache is as follows.
Note that it assumes a write lock on the cache is held.

This procedure is run during post-commit processing after 'current'
has been updated. It can also be run manually via svnadmin.

sync_successor_id_cache() {
  open 'cached-rev' file
  read $cached_rev from 'cached-rev' file

  open 'current' file
  read $new_cached_rev from 'current' file
  close 'current' file

  # The new_successors hash table is keyed by a node_rev_id.
  # Its values are lists of successor IDs.
  #
  # The on-disk cache file format is designed for batch updates.
  # It is inefficient to add new successors to cache files one by one.
  # To update cache files efficiently, this table will grow up to a certain
  # upper bound and be flushed to disk periodically.
  new_successors = {}

  for rev in [$cached_rev, ..., $new_cached_rev] {

    for node_rev in $rev {
      obtain predecessor p_node_rev of $node_rev
      obtain revision p_rev from $p_node_rev
      if cache file $p_rev exists {

        # this uses the 'reader' procedure as explained above
	look up $p_node_rev:$node_rev in cache file

        if $p_node_rev:$node_rev is already listed
	  continue with next node_rev
        else
	  new_successors[$p_node_rev] += $node_rev
      } else {
	new_successors[$p_node_rev] += $node_rev
      }
    }

    if space required by new_successors reached upper bound {
      for node_rev in new_successors.keys() {
        # update_cache_file() implements the 'writer' procedure explained above
        update_cache_file($node_rev, new_successors[$node_rev])
      }
      write $rev into 'cached-rev' file
      flush 'cached-rev' file
      new_successors = {}
    }
  }

  if new_successors != {} {
    for node_rev in new_successors.keys() {
      # update_cache_file() implements the 'writer' procedure explained above
      update_cache_file($node_rev, new_successors[$node_rev])
    }
    write $new_cached_rev into 'cached-rev' file
  }

  close 'cached-rev' file
}


If this code crashes, the following can happen:
  - The cache contains successors computed for revisions beyond the
    revision listed in the 'cached-rev' file.
    => The next writer will not add these successors again. No harm done.


- Packing -

The cache can optionally be packed. This means that successors of
node_revs from more than one revision live in the same cache file.

The on-disk format and cache synchronisation procedure does not change.
The only thing that changes is the procedure to look up the name of the
cache file responsible for a given node-rev. The cache file is named
after the node-rev's revision modulo some fixed and known value.

To create the index of a packed cache file the combined number of
node-revs in all revisions that share the same cache file must be known.
Thus, a packed cache file can only be created after all revisions
which share the cache file have been committed.

Packing can be performed with 'svnadmin pack'.


- How does this design live up to our requirements? -

 - FSFS-compatible:
   - cache must use some plain file format
       => yes
   - writers do not block readers
       => yes, a reader will get consistent data even while the cache
          is being written to
   - cache must not rely on modifying existing revision files
       => yes
 - cache can be regenerated on demand
     => yes
 - do not corrupt the cache in case of an unexpected crash
     => the cache can only contain more information than claimed by
        the 'cached-rev' file
 - try to keep number of directory entries low
     => yes, via sharding
 - conserve inodes by design (preferred), or via packing support
     => yes, via optional packing

 - the repository can be live during cache generation
     => yes, but readers will not see all successors until the cache
        is fully populated

Re: FSFS successor ID design draft

Posted by Johan Corveleyn <jc...@gmail.com>.
On Fri, Aug 26, 2011 at 1:14 PM, Stefan Sperling <st...@elego.de> wrote:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.

Nice! Interesting stuff :-).

I'm not an expert in these matters, but I've read it with interest,
for two reasons: (1) successor-IDs seem to be really useful (cf.
cmpilato's response) and (2) this design seems to be generic enough to
be able to create other caches (or indexes if you will) for FSFS in
the future, with the same basic techniques. Caches for other
information that's expensive to calculate, like <wild idea> line-based
diff-info for speeding up diff and blame -- something that was
discussed a bit during the Berlin Hackathon.</wild idea>

Anyway, I'm getting a bit ahead of myself, and don't want to hijack
this thread. But it seems like a nice standard procedure for building
such a cache, in a robust way, and where callers are able to use
whatever they can from the cache, and still do the expensive
calculation (with some upper-bound perhaps) on the part that wasn't
yet cached (and maybe insert that new information in the cache (?)).
Cool.

I had one remark:

[ ... ]
> - Procedure for Writers -
>
> Note that the following describes how successor data for a single
> node_rev_id is updated. This is only a sub-step of the global cache
> synchronisation procedure (which is described in the next section).
>
> A writer operates under the assumption that the entire cache directory
> has been locked for writing.
>
> The writer locates the cache file corresponding to the given node_rev_id.
>
> The writer opens this file and reads the index to learn the offset of the
> last s-block of the node_rev_id.

Somewhere here should be described what the writer does when this file
doesn't exist yet (create it, put an empty index in there, ... I don't
know).

-- 
Johan

Re: 'svnadmin upgrade' Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
That's the tool I had in mind, but to "lazily populate the cache" is
the tricky bit.

Mark Phippard wrote on Tue, Sep 06, 2011 at 11:50:40 -0400:
> If it makes any sense, you could look at what we did in 1.5 for merge tracking.
> 
> Merge tracking required a new bit of information -- the node origin
> index.  Dump/Load automatically populated this index.  svnadmin
> upgrade bumped the format but did not populate the index.  Instead the
> index was lazily built as needed and you suffered the performance
> penalty of doing it this way.  To offset this, we provided a separate
> tool svn-populate-node-origins-index which a user can run of they want
> to avoid the dump/load but still have predictable performance.  The
> tool runs a lot faster than a dump/load and basically lets you
> manually pre-populate the index so that users do not get the
> inconsistent performance they would get if it was being built on the
> fly at unpredictable moments.
> 
> Mark
> 
> 
> 
> On Tue, Sep 6, 2011 at 11:42 AM, Daniel Shahaf <da...@elego.de> wrote:
> > r1165700 mentions that we need to decide what to do with 'svnadmin
> > upgrade'.
> >
> > 1. We could punt and require a dump/load.  All format >=6 FSFS instances
> >   always have a full successors store.
> >
> > 2. We could store the progeny shard size and the successors data shard
> >   size in the 'format' file.  If those values are absent (or zero),
> >   then the successor lookup operation returns SVN_ERR_UNSUPPORTED_FEATURE.
> >
> >   We can also reconstruct the successors cache without a dump/load, via
> >   a tools/ binary that operates as follows:
> >
> >     for (i = 0; i < CONSTANT; i++)
> >       check 'current' and build the successors data up to that revision
> >     with write lock:
> >       check 'current' and build the successors data up to that revision
> >       updates 'format' with the shard sizes
> >     (release write lock)
> >
> > Requiring a dump/load is a barrier for _any_ upgrade from pre-1.8 to
> > post-1.8, so I prefer (2) to (1).  'svnadmin upgrade' might want to warn
> > about this.
> >
> > Stefan Sperling wrote on Thu, Sep 01, 2011 at 00:06:40 +0200:
> >> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> >> > I'll try to tweak my proposal such that successor ID updates become
> >> > transactional and happen as part of every commit.
> >>
> >> Here's a first shot at this. Comments welcome.
> >>
> >> To support transactional semantics, any new data written to the FSFS
> >> filesystem must be tied to the newly created revision.
> >> This way, updating 'current' as the final step in each commit makes the
> >> new data appear. Failing to update 'current' will cause any data added
> >> by failing commits to be overwritten by the next successfull commit.
> >>
> >> The following design tries to keep this principle valid for sucessor-ids.
> >> There is one compromise though which hopefully won't hurt too much.
> >>
> >> Thanks again to danielsh for providing some initial feedback via IRC.
> >>
> >>
> >> Implementation proposal
> >> =======================
> >>
> >>
> >> - On-disk File Format -
> >>
> >> All files related to the map are stored in the repos/db/successors
> >> directory ("successors directory").
> >>
> >> The successor map contains 3 kinds of files:
> >>
> >>  1) Node-revision map files.
> >>     These files map node-revision IDs to lists of revisions.
> >>     This map answers the question:
> >>     "In which revisions were successors of a node-revision created?"
> >>
> >>  2) Created-revision map file.
> >>     These files map revision numbers of file offsets in successor data files.
> >>     This map answers the question:
> >>     "Where in the successor data are those successors which were created
> >>      in revision N?"
> >>
> >>  3) Successor data files.
> >>     These files store raw successor data.
> >>
> >> The different kinds of files refer to each other as shown below:
> >>
> >>                  node_rev_id map     revision map      successor data
> >>                    +-------+          +--------+       +-----------+
> >>                    |  ...  |          |  ...   |       | ...       |
> >>               +------>rN---------+    | offset |  +----->successor |
> >>               |    |  ...  |     |    | offset |  |    | successor |
> >>               |    |  ...  |     +----->offset----+    | END       |
> >> node_rev_id --+    |  rQ   |          |  ...   |       | successor |
> >>               |    |  ...  |          | offset |       | ...       |
> >>               +------>rX---------+    |  ...   |       | ...       |
> >>                    |  ...  |     |    |  ...   |       | END       |
> >>                    |  ...  |     +----->offset---------->successor |
> >>                    |  rZ   |          |  ...   |       | END       |
> >>                    +-------+          +--------+       +-----------+
> >>
> >> Each type of file is described below in detail.
> >>
> >>
> >> -- Node-revision-ID map files --
> >>
> >> The purpose of node-revision-id map files is to facilitate lookup
> >> of successors of a given node-revision.
> >>
> >> The files are named after the number of the revision the node-revision
> >> was created in, modulo some fixed to-be-determined value (i.e. there
> >> won't be one file per node revision, but one file for every 10, 100,
> >> or so, node-revisions).
> >>
> >> The files are stored within the "node-revs" sub-directory:
> >>
> >>  db/successors/node-revs/1
> >>  db/successors/node-revs/2
> >>  db/successors/node-revs/3
> >>  ...
> >>  db/successors/node-revs/N
> >>  db/successors/node-revs/M
> >>  ...
> >>
> >> Each node-revision map file stores a mapping of the form
> >>   node_rev_id => [revnum, revnum, ...]
> >> The revision numbers identify those revisions in which one or more
> >> successors of a given node-revision were added.
> >>
> >> In the first iteration of this design, the mapping is stored as lines
> >> of ASCII text. Each line contains an unparsed node_revision_id, a space
> >> separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> >> (The design may be refined later to use a more efficient storage model.)
> >>
> >>
> >> -- Revision map files --
> >>
> >> The purpose of the revision map file is to facilitate lookup
> >> of successor data created in a given revision.
> >>
> >> The files are named after the numbers of revisions they are responsible for,
> >> modulo some fixed to-be-determined value (i.e. there won't be one file
> >> per revision, but one file for every 10, 100, or so, revisions; each file
> >> is responsible for some range of revisions).
> >>
> >> The files are stored within the "revs" sub-directory:
> >>
> >>  db/successors/revs/1
> >>  db/successors/revs/2
> >>  db/successors/revs/3
> >>  ...
> >>  db/successors/revs/N
> >>  db/successors/revs/M
> >>  ...
> >>
> >>
> >> Each file consists of an array of 64bit big-endian integers.
> >> The N'th integer (starting from N=1) in the file specifies the offset
> >> of successor data (in the corresponding successor data file) which was
> >> added in the N'th revision within the revision range the map file is
> >> responsible for.
> >>
> >>
> >> -- Successor-data files --
> >>
> >> These files use the same naming scheme as the revision map files (i.e.
> >> each successor data file is responsible for successors created within
> >> a specific range of revisions).
> >>
> >> The files are stored within the "successor-ids" sub-directory:
> >>
> >>  db/successors/successor-ids/1
> >>  db/successors/successor-ids/2
> >>  db/successors/successor-ids/3
> >>  ...
> >>  db/successors/successor-ids/N
> >>  db/successors/successor-ids/M
> >>  ...
> >>
> >>
> >> Each file consists of lines each containing two unparsed node_rev_id
> >> ASCII strings, separated by ASCII 0x20. The first node-revision ID is a
> >> predecessor, and the second one is one of its successors.
> >> The special string "END" on a single line signifies that the following
> >> successor data belongs to the next revision.
> >> (The design may be refined later to use a more efficient storage model.)
> >>
> >>
> >> - Procedure for Readers -
> >>
> >> A reader first reads the 'current' file and remembers the revision number.
> >>
> >> It locates the node-revision map file corresponding to the given node_rev_id,
> >> based on the node_rev_id's revision modulo a known fixed value.
> >>
> >> It opens this file and reads it to obtain a list of revisions which created
> >> successors of the given node_rev_id. It ignores lines listing node_rev_ids
> >> not matching the given node_rev_id. It also ignores revisions greater than
> >> 'current'.
> >>
> >> Next, the reader opens the corresponding revision map files,
> >> based on revision numbers in the list, each modulo a known fixed value.
> >> For each revision N in the revision list, it seeks to the revision's byte
> >> offset within a revision map file and reads 4 bytes to obtain the offset
> >> corresponding to revision N within a successor data file.
> >>
> >> The reader then opens the corresponding successor data files,
> >> based on revision numbers in the list, each modulo a known fixed value.
> >> For each revision offset, it seeks to this offset and reads lines until it
> >> finds the line "END". It loops through all lines and appends those
> >> successor IDs to the set of successors which list a predecessor matching
> >> the given node_rev_id.
> >>
> >> In exceptional cases (see below) it is possible that a given revision
> >> does not actually contain any successors of the given node_rev_id.
> >>
> >>
> >> - Procedure for Writers -
> >>
> >> Assume that the writer has access to a mapping from predecessor
> >> node-revision ID to one or more successor node-revision IDs.
> >> (The exact mechanism managing this mapping still needs to be determined.
> >> But it can be assumed to be available as all predecessors and successors
> >> are known within the context of a commit.)
> >>
> >> When committing a finished transaction (in fs_fs.c:commit_body()),
> >> the writer obtains a write-lock on all files within the successors
> >> directory it needs to modify.
> >>
> >> Let rN be the new revision the writer is committing.
> >>
> >> The writer creates an empty temporary file.
> >>
> >> If rN falls within the revision range of an existing successor data
> >> file, the writer looks up the offset of revision N-1 in the corresponding
> >> revision map file. It copies content from the corresponding successor data
> >> file to a temporary file up to this offset, and copies any data that follows
> >> until a line containing "END" has been copied. (Note that this step discards
> >> data left behind by any previous attempt at committing rN).
> >>
> >> The writer appends any new successor data to the temporary file (note
> >> that there may be no successor data in case the revision is empty).
> >> It then adds a terminating "END".
> >>
> >> The writer creates another empty temporary file.
> >>
> >> If rN falls within the revision range of an existing revision map file,
> >> it copies ((N-2)*4) bytes of content from the revision map file into the
> >> temporary file. (Note that this step discards data left behind by any
> >> previous attempt at committing rN).
> >>
> >> The writer appends, to the temporary file, the offset of the new data
> >> it wrote into the successor data file.
> >>
> >> Likewise, the writer creates empty temporary files for each node-revision
> >> map files is needs to modify. It copies all content from any such
> >> node-revision map files which already exist, and appends a line to each
> >> file containing the node_revision ID and the new revision number.
> >>
> >> After moving the proto-revprop file into place, and before updating
> >> 'current', the writer moves temporary files into place in the successors
> >> directory, in the following order:
> >>
> >>   1) the new successor data file
> >>   2) the new revision map file
> >>   3) each new node_rev_id map file
> >>
> >> If the writer fails to update all files in the successors directory,
> >> it will also fail to update 'current'. In this case, readers will ignore
> >> the new successor data until another commit succeeds in creating rN.
> >> Once a commit of rN has succeeded, readers following node-rev-id map
> >> entries updated by the failing commit might end up with an empty
> >> successors list for rN. Such erroneous entries will not be cleaned
> >> up automatically, but can be eliminated by re-creating the repository
> >> (e.g. via a dump/load cycle). However, the actual successor data stored
> >> for committed revisions is always correct, and the likelihood of incorrect
> >> node-revision ID map entries to occur is small.
> >
> 
> 
> 
> -- 
> Thanks
> 
> Mark Phippard
> http://markphip.blogspot.com/

Re: 'svnadmin upgrade' Re: FSFS successor ID design draft

Posted by Mark Phippard <ma...@gmail.com>.
If it makes any sense, you could look at what we did in 1.5 for merge tracking.

Merge tracking required a new bit of information -- the node origin
index.  Dump/Load automatically populated this index.  svnadmin
upgrade bumped the format but did not populate the index.  Instead the
index was lazily built as needed and you suffered the performance
penalty of doing it this way.  To offset this, we provided a separate
tool svn-populate-node-origins-index which a user can run of they want
to avoid the dump/load but still have predictable performance.  The
tool runs a lot faster than a dump/load and basically lets you
manually pre-populate the index so that users do not get the
inconsistent performance they would get if it was being built on the
fly at unpredictable moments.

Mark



On Tue, Sep 6, 2011 at 11:42 AM, Daniel Shahaf <da...@elego.de> wrote:
> r1165700 mentions that we need to decide what to do with 'svnadmin
> upgrade'.
>
> 1. We could punt and require a dump/load.  All format >=6 FSFS instances
>   always have a full successors store.
>
> 2. We could store the progeny shard size and the successors data shard
>   size in the 'format' file.  If those values are absent (or zero),
>   then the successor lookup operation returns SVN_ERR_UNSUPPORTED_FEATURE.
>
>   We can also reconstruct the successors cache without a dump/load, via
>   a tools/ binary that operates as follows:
>
>     for (i = 0; i < CONSTANT; i++)
>       check 'current' and build the successors data up to that revision
>     with write lock:
>       check 'current' and build the successors data up to that revision
>       updates 'format' with the shard sizes
>     (release write lock)
>
> Requiring a dump/load is a barrier for _any_ upgrade from pre-1.8 to
> post-1.8, so I prefer (2) to (1).  'svnadmin upgrade' might want to warn
> about this.
>
> Stefan Sperling wrote on Thu, Sep 01, 2011 at 00:06:40 +0200:
>> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
>> > I'll try to tweak my proposal such that successor ID updates become
>> > transactional and happen as part of every commit.
>>
>> Here's a first shot at this. Comments welcome.
>>
>> To support transactional semantics, any new data written to the FSFS
>> filesystem must be tied to the newly created revision.
>> This way, updating 'current' as the final step in each commit makes the
>> new data appear. Failing to update 'current' will cause any data added
>> by failing commits to be overwritten by the next successfull commit.
>>
>> The following design tries to keep this principle valid for sucessor-ids.
>> There is one compromise though which hopefully won't hurt too much.
>>
>> Thanks again to danielsh for providing some initial feedback via IRC.
>>
>>
>> Implementation proposal
>> =======================
>>
>>
>> - On-disk File Format -
>>
>> All files related to the map are stored in the repos/db/successors
>> directory ("successors directory").
>>
>> The successor map contains 3 kinds of files:
>>
>>  1) Node-revision map files.
>>     These files map node-revision IDs to lists of revisions.
>>     This map answers the question:
>>     "In which revisions were successors of a node-revision created?"
>>
>>  2) Created-revision map file.
>>     These files map revision numbers of file offsets in successor data files.
>>     This map answers the question:
>>     "Where in the successor data are those successors which were created
>>      in revision N?"
>>
>>  3) Successor data files.
>>     These files store raw successor data.
>>
>> The different kinds of files refer to each other as shown below:
>>
>>                  node_rev_id map     revision map      successor data
>>                    +-------+          +--------+       +-----------+
>>                    |  ...  |          |  ...   |       | ...       |
>>               +------>rN---------+    | offset |  +----->successor |
>>               |    |  ...  |     |    | offset |  |    | successor |
>>               |    |  ...  |     +----->offset----+    | END       |
>> node_rev_id --+    |  rQ   |          |  ...   |       | successor |
>>               |    |  ...  |          | offset |       | ...       |
>>               +------>rX---------+    |  ...   |       | ...       |
>>                    |  ...  |     |    |  ...   |       | END       |
>>                    |  ...  |     +----->offset---------->successor |
>>                    |  rZ   |          |  ...   |       | END       |
>>                    +-------+          +--------+       +-----------+
>>
>> Each type of file is described below in detail.
>>
>>
>> -- Node-revision-ID map files --
>>
>> The purpose of node-revision-id map files is to facilitate lookup
>> of successors of a given node-revision.
>>
>> The files are named after the number of the revision the node-revision
>> was created in, modulo some fixed to-be-determined value (i.e. there
>> won't be one file per node revision, but one file for every 10, 100,
>> or so, node-revisions).
>>
>> The files are stored within the "node-revs" sub-directory:
>>
>>  db/successors/node-revs/1
>>  db/successors/node-revs/2
>>  db/successors/node-revs/3
>>  ...
>>  db/successors/node-revs/N
>>  db/successors/node-revs/M
>>  ...
>>
>> Each node-revision map file stores a mapping of the form
>>   node_rev_id => [revnum, revnum, ...]
>> The revision numbers identify those revisions in which one or more
>> successors of a given node-revision were added.
>>
>> In the first iteration of this design, the mapping is stored as lines
>> of ASCII text. Each line contains an unparsed node_revision_id, a space
>> separator (ASCII 0x20), and a revision number in ASCII decimal representation.
>> (The design may be refined later to use a more efficient storage model.)
>>
>>
>> -- Revision map files --
>>
>> The purpose of the revision map file is to facilitate lookup
>> of successor data created in a given revision.
>>
>> The files are named after the numbers of revisions they are responsible for,
>> modulo some fixed to-be-determined value (i.e. there won't be one file
>> per revision, but one file for every 10, 100, or so, revisions; each file
>> is responsible for some range of revisions).
>>
>> The files are stored within the "revs" sub-directory:
>>
>>  db/successors/revs/1
>>  db/successors/revs/2
>>  db/successors/revs/3
>>  ...
>>  db/successors/revs/N
>>  db/successors/revs/M
>>  ...
>>
>>
>> Each file consists of an array of 64bit big-endian integers.
>> The N'th integer (starting from N=1) in the file specifies the offset
>> of successor data (in the corresponding successor data file) which was
>> added in the N'th revision within the revision range the map file is
>> responsible for.
>>
>>
>> -- Successor-data files --
>>
>> These files use the same naming scheme as the revision map files (i.e.
>> each successor data file is responsible for successors created within
>> a specific range of revisions).
>>
>> The files are stored within the "successor-ids" sub-directory:
>>
>>  db/successors/successor-ids/1
>>  db/successors/successor-ids/2
>>  db/successors/successor-ids/3
>>  ...
>>  db/successors/successor-ids/N
>>  db/successors/successor-ids/M
>>  ...
>>
>>
>> Each file consists of lines each containing two unparsed node_rev_id
>> ASCII strings, separated by ASCII 0x20. The first node-revision ID is a
>> predecessor, and the second one is one of its successors.
>> The special string "END" on a single line signifies that the following
>> successor data belongs to the next revision.
>> (The design may be refined later to use a more efficient storage model.)
>>
>>
>> - Procedure for Readers -
>>
>> A reader first reads the 'current' file and remembers the revision number.
>>
>> It locates the node-revision map file corresponding to the given node_rev_id,
>> based on the node_rev_id's revision modulo a known fixed value.
>>
>> It opens this file and reads it to obtain a list of revisions which created
>> successors of the given node_rev_id. It ignores lines listing node_rev_ids
>> not matching the given node_rev_id. It also ignores revisions greater than
>> 'current'.
>>
>> Next, the reader opens the corresponding revision map files,
>> based on revision numbers in the list, each modulo a known fixed value.
>> For each revision N in the revision list, it seeks to the revision's byte
>> offset within a revision map file and reads 4 bytes to obtain the offset
>> corresponding to revision N within a successor data file.
>>
>> The reader then opens the corresponding successor data files,
>> based on revision numbers in the list, each modulo a known fixed value.
>> For each revision offset, it seeks to this offset and reads lines until it
>> finds the line "END". It loops through all lines and appends those
>> successor IDs to the set of successors which list a predecessor matching
>> the given node_rev_id.
>>
>> In exceptional cases (see below) it is possible that a given revision
>> does not actually contain any successors of the given node_rev_id.
>>
>>
>> - Procedure for Writers -
>>
>> Assume that the writer has access to a mapping from predecessor
>> node-revision ID to one or more successor node-revision IDs.
>> (The exact mechanism managing this mapping still needs to be determined.
>> But it can be assumed to be available as all predecessors and successors
>> are known within the context of a commit.)
>>
>> When committing a finished transaction (in fs_fs.c:commit_body()),
>> the writer obtains a write-lock on all files within the successors
>> directory it needs to modify.
>>
>> Let rN be the new revision the writer is committing.
>>
>> The writer creates an empty temporary file.
>>
>> If rN falls within the revision range of an existing successor data
>> file, the writer looks up the offset of revision N-1 in the corresponding
>> revision map file. It copies content from the corresponding successor data
>> file to a temporary file up to this offset, and copies any data that follows
>> until a line containing "END" has been copied. (Note that this step discards
>> data left behind by any previous attempt at committing rN).
>>
>> The writer appends any new successor data to the temporary file (note
>> that there may be no successor data in case the revision is empty).
>> It then adds a terminating "END".
>>
>> The writer creates another empty temporary file.
>>
>> If rN falls within the revision range of an existing revision map file,
>> it copies ((N-2)*4) bytes of content from the revision map file into the
>> temporary file. (Note that this step discards data left behind by any
>> previous attempt at committing rN).
>>
>> The writer appends, to the temporary file, the offset of the new data
>> it wrote into the successor data file.
>>
>> Likewise, the writer creates empty temporary files for each node-revision
>> map files is needs to modify. It copies all content from any such
>> node-revision map files which already exist, and appends a line to each
>> file containing the node_revision ID and the new revision number.
>>
>> After moving the proto-revprop file into place, and before updating
>> 'current', the writer moves temporary files into place in the successors
>> directory, in the following order:
>>
>>   1) the new successor data file
>>   2) the new revision map file
>>   3) each new node_rev_id map file
>>
>> If the writer fails to update all files in the successors directory,
>> it will also fail to update 'current'. In this case, readers will ignore
>> the new successor data until another commit succeeds in creating rN.
>> Once a commit of rN has succeeded, readers following node-rev-id map
>> entries updated by the failing commit might end up with an empty
>> successors list for rN. Such erroneous entries will not be cleaned
>> up automatically, but can be eliminated by re-creating the repository
>> (e.g. via a dump/load cycle). However, the actual successor data stored
>> for committed revisions is always correct, and the likelihood of incorrect
>> node-revision ID map entries to occur is small.
>



-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: 'svnadmin upgrade' Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Sep 06, 2011 at 06:42:44PM +0300, Daniel Shahaf wrote:
> r1165700 mentions that we need to decide what to do with 'svnadmin
> upgrade'.
> 
> 1. We could punt and require a dump/load.  All format >=6 FSFS instances
>    always have a full successors store.
> 
> 2. We could store the progeny shard size and the successors data shard
>    size in the 'format' file.  If those values are absent (or zero),
>    then the successor lookup operation returns SVN_ERR_UNSUPPORTED_FEATURE.
> 
>    We can also reconstruct the successors cache without a dump/load, via
>    a tools/ binary that operates as follows:
>    
>      for (i = 0; i < CONSTANT; i++)
>        check 'current' and build the successors data up to that revision
>      with write lock:
>        check 'current' and build the successors data up to that revision
>        updates 'format' with the shard sizes
>      (release write lock)
> 

3. Have 'svnadmin upgrade' populate successor data by harvesting
   existing revision files.

This will take a while. I don't know how long exactly.
As I mentioned in the commit message, if we do for this route we should
at least provide progress information (e.g. in form of outputting the
numbers of revisions which have been processed).

Upgrade code will also need to be written for the BDB backend.

'svnadmin upgrade' Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
r1165700 mentions that we need to decide what to do with 'svnadmin
upgrade'.

1. We could punt and require a dump/load.  All format >=6 FSFS instances
   always have a full successors store.

2. We could store the progeny shard size and the successors data shard
   size in the 'format' file.  If those values are absent (or zero),
   then the successor lookup operation returns SVN_ERR_UNSUPPORTED_FEATURE.

   We can also reconstruct the successors cache without a dump/load, via
   a tools/ binary that operates as follows:
   
     for (i = 0; i < CONSTANT; i++)
       check 'current' and build the successors data up to that revision
     with write lock:
       check 'current' and build the successors data up to that revision
       updates 'format' with the shard sizes
     (release write lock)

Requiring a dump/load is a barrier for _any_ upgrade from pre-1.8 to
post-1.8, so I prefer (2) to (1).  'svnadmin upgrade' might want to warn
about this.

Stefan Sperling wrote on Thu, Sep 01, 2011 at 00:06:40 +0200:
> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> > I'll try to tweak my proposal such that successor ID updates become
> > transactional and happen as part of every commit.
> 
> Here's a first shot at this. Comments welcome.
> 
> To support transactional semantics, any new data written to the FSFS
> filesystem must be tied to the newly created revision.
> This way, updating 'current' as the final step in each commit makes the
> new data appear. Failing to update 'current' will cause any data added
> by failing commits to be overwritten by the next successfull commit.
> 
> The following design tries to keep this principle valid for sucessor-ids.
> There is one compromise though which hopefully won't hurt too much.
> 
> Thanks again to danielsh for providing some initial feedback via IRC.
> 
> 
> Implementation proposal
> =======================
> 
> 
> - On-disk File Format -
> 
> All files related to the map are stored in the repos/db/successors
> directory ("successors directory").
> 
> The successor map contains 3 kinds of files:
> 
>  1) Node-revision map files.
>     These files map node-revision IDs to lists of revisions.
>     This map answers the question:
>     "In which revisions were successors of a node-revision created?"
> 
>  2) Created-revision map file.
>     These files map revision numbers of file offsets in successor data files.
>     This map answers the question:
>     "Where in the successor data are those successors which were created
>      in revision N?"
> 
>  3) Successor data files.
>     These files store raw successor data.
> 
> The different kinds of files refer to each other as shown below:
> 
>                  node_rev_id map     revision map      successor data
>                    +-------+          +--------+       +-----------+
>                    |  ...  |          |  ...   |       | ...       |
>               +------>rN---------+    | offset |  +----->successor | 
>               |    |  ...  |     |    | offset |  |    | successor |
>               |    |  ...  |     +----->offset----+    | END       |
> node_rev_id --+    |  rQ   |          |  ...   |       | successor |
>               |    |  ...  |          | offset |       | ...       |
>               +------>rX---------+    |  ...   |       | ...       |
>                    |  ...  |     |    |  ...   |       | END       |
>                    |  ...  |     +----->offset---------->successor |
>                    |  rZ   |          |  ...   |       | END       |
>                    +-------+          +--------+       +-----------+
> 
> Each type of file is described below in detail.
> 
> 
> -- Node-revision-ID map files --
> 
> The purpose of node-revision-id map files is to facilitate lookup
> of successors of a given node-revision.
> 
> The files are named after the number of the revision the node-revision
> was created in, modulo some fixed to-be-determined value (i.e. there
> won't be one file per node revision, but one file for every 10, 100,
> or so, node-revisions).
> 
> The files are stored within the "node-revs" sub-directory:
> 
>  db/successors/node-revs/1
>  db/successors/node-revs/2
>  db/successors/node-revs/3
>  ...
>  db/successors/node-revs/N
>  db/successors/node-revs/M
>  ...
> 
> Each node-revision map file stores a mapping of the form
>   node_rev_id => [revnum, revnum, ...]
> The revision numbers identify those revisions in which one or more
> successors of a given node-revision were added.
> 
> In the first iteration of this design, the mapping is stored as lines
> of ASCII text. Each line contains an unparsed node_revision_id, a space
> separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> (The design may be refined later to use a more efficient storage model.)
> 
> 
> -- Revision map files --
> 
> The purpose of the revision map file is to facilitate lookup
> of successor data created in a given revision.
> 
> The files are named after the numbers of revisions they are responsible for,
> modulo some fixed to-be-determined value (i.e. there won't be one file
> per revision, but one file for every 10, 100, or so, revisions; each file
> is responsible for some range of revisions).
> 
> The files are stored within the "revs" sub-directory:
> 
>  db/successors/revs/1
>  db/successors/revs/2
>  db/successors/revs/3
>  ...
>  db/successors/revs/N
>  db/successors/revs/M
>  ...
> 
> 
> Each file consists of an array of 64bit big-endian integers.
> The N'th integer (starting from N=1) in the file specifies the offset
> of successor data (in the corresponding successor data file) which was
> added in the N'th revision within the revision range the map file is
> responsible for.
> 
> 
> -- Successor-data files --
> 
> These files use the same naming scheme as the revision map files (i.e.
> each successor data file is responsible for successors created within
> a specific range of revisions).
> 
> The files are stored within the "successor-ids" sub-directory:
> 
>  db/successors/successor-ids/1
>  db/successors/successor-ids/2
>  db/successors/successor-ids/3
>  ...
>  db/successors/successor-ids/N
>  db/successors/successor-ids/M
>  ...
> 
> 
> Each file consists of lines each containing two unparsed node_rev_id
> ASCII strings, separated by ASCII 0x20. The first node-revision ID is a
> predecessor, and the second one is one of its successors.
> The special string "END" on a single line signifies that the following
> successor data belongs to the next revision.
> (The design may be refined later to use a more efficient storage model.)
> 
> 
> - Procedure for Readers -
> 
> A reader first reads the 'current' file and remembers the revision number.
> 
> It locates the node-revision map file corresponding to the given node_rev_id,
> based on the node_rev_id's revision modulo a known fixed value.
> 
> It opens this file and reads it to obtain a list of revisions which created
> successors of the given node_rev_id. It ignores lines listing node_rev_ids
> not matching the given node_rev_id. It also ignores revisions greater than
> 'current'.
> 
> Next, the reader opens the corresponding revision map files,
> based on revision numbers in the list, each modulo a known fixed value.
> For each revision N in the revision list, it seeks to the revision's byte
> offset within a revision map file and reads 4 bytes to obtain the offset
> corresponding to revision N within a successor data file.
> 
> The reader then opens the corresponding successor data files,
> based on revision numbers in the list, each modulo a known fixed value.
> For each revision offset, it seeks to this offset and reads lines until it
> finds the line "END". It loops through all lines and appends those
> successor IDs to the set of successors which list a predecessor matching
> the given node_rev_id.
> 
> In exceptional cases (see below) it is possible that a given revision
> does not actually contain any successors of the given node_rev_id.
> 
> 
> - Procedure for Writers -
> 
> Assume that the writer has access to a mapping from predecessor
> node-revision ID to one or more successor node-revision IDs.
> (The exact mechanism managing this mapping still needs to be determined.
> But it can be assumed to be available as all predecessors and successors
> are known within the context of a commit.)
> 
> When committing a finished transaction (in fs_fs.c:commit_body()),
> the writer obtains a write-lock on all files within the successors
> directory it needs to modify.
> 
> Let rN be the new revision the writer is committing.
> 
> The writer creates an empty temporary file.
> 
> If rN falls within the revision range of an existing successor data
> file, the writer looks up the offset of revision N-1 in the corresponding
> revision map file. It copies content from the corresponding successor data
> file to a temporary file up to this offset, and copies any data that follows
> until a line containing "END" has been copied. (Note that this step discards
> data left behind by any previous attempt at committing rN).  
> 
> The writer appends any new successor data to the temporary file (note
> that there may be no successor data in case the revision is empty).
> It then adds a terminating "END".
> 
> The writer creates another empty temporary file.
> 
> If rN falls within the revision range of an existing revision map file,
> it copies ((N-2)*4) bytes of content from the revision map file into the
> temporary file. (Note that this step discards data left behind by any
> previous attempt at committing rN).
> 
> The writer appends, to the temporary file, the offset of the new data
> it wrote into the successor data file.
> 
> Likewise, the writer creates empty temporary files for each node-revision
> map files is needs to modify. It copies all content from any such
> node-revision map files which already exist, and appends a line to each
> file containing the node_revision ID and the new revision number.
> 
> After moving the proto-revprop file into place, and before updating
> 'current', the writer moves temporary files into place in the successors
> directory, in the following order:
> 
>   1) the new successor data file
>   2) the new revision map file
>   3) each new node_rev_id map file
> 
> If the writer fails to update all files in the successors directory,
> it will also fail to update 'current'. In this case, readers will ignore
> the new successor data until another commit succeeds in creating rN.
> Once a commit of rN has succeeded, readers following node-rev-id map
> entries updated by the failing commit might end up with an empty
> successors list for rN. Such erroneous entries will not be cleaned
> up automatically, but can be eliminated by re-creating the repository
> (e.g. via a dump/load cycle). However, the actual successor data stored
> for committed revisions is always correct, and the likelihood of incorrect
> node-revision ID map entries to occur is small.

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Daniel Shahaf wrote on Sun, Sep 04, 2011 at 02:23:28 +0300:
> some noderevs --- for example, the noderev for ^/subversion/trunk/BUGS
> --- are copied much more than others,

Bad example.

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Sperling wrote on Mon, Sep 05, 2011 at 12:52:10 +0200:
> On Mon, Sep 05, 2011 at 01:23:14PM +0300, Daniel Shahaf wrote:
> > Stefan Sperling wrote on Mon, Sep 05, 2011 at 11:38:11 +0200:
> > > So you're saying that we should run the plaintext proposed above
> > > through svndiff? Can you explain in more detail how this would work?
> > > What is the base of a delta?
> > > 
> > 
> > The file contains one or more DELTA\n..ENDREP\n streams:
> > 
> >   DELTA
> >   <svndiff stream>
> >   ENDREP
> >   DELTA
> >   <svndiff stream>
> >   ENDREP
> > 
> > (In second thought, we should be storing the length of the stream
> > somewhere; on the DELTA header seems a fine place:
> > 
> >   DELTA 512
> >   <512 bytes of svndiff stream>
> >   ENDREP
> >   DELTA 37
> >   <37 bytes of svndiff stream>
> >   ENDREP
> > 
> > .)  When the file is read, readers decode all the deltas and concatenate
> > the resulting plaintexts.  When the file is rewritten, writers
> > optionally combine the first N deltas into a single delta that produces
> > the combined plaintext.
> > 
> > The deltas can be self-compressed (like a DELTA\n rep in the revision
> > files), ie, having no base.
> 
> OK, I see. You're trying to save disk space, trading it for CPU time
> during read/write operations. Does that make sense? Is the amount of
> data really going to be big enough to be worth it?
> 

When you phrase it that way: I doubt it.

You suggested a 'more efficient storage model', so I remarked about one
idea that crossed my mind.  Obviously there are others, such as grouping
the storage by LHS of the mapping, rather than scattering map entries
with the same LHS all over the file.

> > > What is 'lhs'?
> > lhs = left-hand side
> > rhs = right-hand side
> 
> > How about calling them after ths RHS'es of the mappings rather than
> > after the fact that they are mappings?
> > 
> > 
> > Currently:
> > 
> > - noderev map file, revision map file, successors data file
> > 
> > Perhaps:
> > 
> > - noderev posterity file, successor offsets file, successors data file
> 
> These names are fine with me.
> 
> What would you call them on disk?
> 

/successors/progeny/M
/successors/offsets/N
/successors/data/N

where M and N are the oldest revision number of their respective shards.
(So, for example, M = 0, 1000, 2000, 3000... )

Can probably improve on those names a bit...


Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Sep 05, 2011 at 01:23:14PM +0300, Daniel Shahaf wrote:
> Stefan Sperling wrote on Mon, Sep 05, 2011 at 11:38:11 +0200:
> > So you're saying that we should run the plaintext proposed above
> > through svndiff? Can you explain in more detail how this would work?
> > What is the base of a delta?
> > 
> 
> The file contains one or more DELTA\n..ENDREP\n streams:
> 
>   DELTA
>   <svndiff stream>
>   ENDREP
>   DELTA
>   <svndiff stream>
>   ENDREP
> 
> (In second thought, we should be storing the length of the stream
> somewhere; on the DELTA header seems a fine place:
> 
>   DELTA 512
>   <512 bytes of svndiff stream>
>   ENDREP
>   DELTA 37
>   <37 bytes of svndiff stream>
>   ENDREP
> 
> .)  When the file is read, readers decode all the deltas and concatenate
> the resulting plaintexts.  When the file is rewritten, writers
> optionally combine the first N deltas into a single delta that produces
> the combined plaintext.
> 
> The deltas can be self-compressed (like a DELTA\n rep in the revision
> files), ie, having no base.

OK, I see. You're trying to save disk space, trading it for CPU time
during read/write operations. Does that make sense? Is the amount of
data really going to be big enough to be worth it?

> > What is 'lhs'?
> lhs = left-hand side
> rhs = right-hand side

> How about calling them after ths RHS'es of the mappings rather than
> after the fact that they are mappings?
> 
> 
> Currently:
> 
> - noderev map file, revision map file, successors data file
> 
> Perhaps:
> 
> - noderev posterity file, successor offsets file, successors data file

These names are fine with me.

What would you call them on disk?

> (Is 'progeny' the more appropriate word here?

I like 'progeny' because it means 'immediate offspring'.
'Posterity' includes all ancestors in all generations, and that's not
what the file is storing.

> > I am happy to just leave this debris in the files for now.
> > 
> > I would guess that nobody will ever even notice this problem in practice.
> > The number of commits failing within the time window where succesor data
> > is updated will statistically be very low to begin with.
> > Each time it happens we lose a very small fraction of disk space. We also
> > suffer a teeny tiny bit of read performance loss for readers of successors
> > of the affected node-revision. So what...
> > 
> > If it ever becomes a real problem, people can dump/load.
> > 
> 
> It'll work, but it's a kill a fly with a fleet approach :).  Dump/load
> the entire history (many MB of svndiffs) only to fix some derived
> noderev->offset map data?

Sure, it's not optimal. I doubt anyone will be bothered enough to
perform a dump/load just for this.

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Sperling wrote on Mon, Sep 05, 2011 at 11:38:11 +0200:
> On Sun, Sep 04, 2011 at 02:23:28AM +0300, Daniel Shahaf wrote:
> > > Each node-revision map file stores a mapping of the form
> > >   node_rev_id => [revnum, revnum, ...]
> > > The revision numbers identify those revisions in which one or more
> > > successors of a given node-revision were added.
> > > 
> > > In the first iteration of this design, the mapping is stored as lines
> > > of ASCII text. Each line contains an unparsed node_revision_id, a space
> > > separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> > > (The design may be refined later to use a more efficient storage model.)
> > > 
> > 
> > Some wild ideas here:
> > 
> > - Multiple, concatenated DELTA\n..ENDREP\n svndiff streams.  That's
> >   appendable and compressible (a deltify process can combin adjacent
> >   streams, which will trigger zlib compression).
> > 
> >   (That won't work for the successors data since offsets into it are
> >   stored elsewhere.)
> 
> So you're saying that we should run the plaintext proposed above
> through svndiff? Can you explain in more detail how this would work?
> What is the base of a delta?
> 

The file contains one or more DELTA\n..ENDREP\n streams:

  DELTA
  <svndiff stream>
  ENDREP
  DELTA
  <svndiff stream>
  ENDREP

(In second thought, we should be storing the length of the stream
somewhere; on the DELTA header seems a fine place:

  DELTA 512
  <512 bytes of svndiff stream>
  ENDREP
  DELTA 37
  <37 bytes of svndiff stream>
  ENDREP

.)  When the file is read, readers decode all the deltas and concatenate
the resulting plaintexts.  When the file is rewritten, writers
optionally combine the first N deltas into a single delta that produces
the combined plaintext.

The deltas can be self-compressed (like a DELTA\n rep in the revision
files), ie, having no base.

> > - Some packing support that also implements grouping of all the map
> >   entries with the same lhs?  Is that needed, or should the API at this
> 
> What is 'lhs'?
> 

lhs = left-hand side
rhs = right-hand side

> >   level only implement 'slurp the entire map into a hash mapping ints to
> >   arrays of revnums'?
> 
> I would say so, with the possible retriction that only a certain range
> of revisions is slurped into memory. This way, the API can optimise for
> callers which are only interested in a certain revision range (e.g.
> graphing tools can reconstruct the history of a file in several steps).
>  
> > (OT: those terms for the three kinds of files you have haven't sat well
> > with me; from the term alone I couldn't instantly recall what the
> > contents of the file were.)
> 
> What would you call them? I'm open to suggestions.
> 

How about calling them after ths RHS'es of the mappings rather than
after the fact that they are mappings?


Currently:

- noderev map file, revision map file, successors data file

Perhaps:

- noderev posterity file, successor offsets file, successors data file

(Is 'progeny' the more appropriate word here?  I'm not sufficiently
familiar with its use, and there are no mentions of it as a simple noun
in the dev@ archives...)

> > > In exceptional cases (see below) it is possible that a given revision
> > > does not actually contain any successors of the given node_rev_id.
> > > 
> > 
> > Implementation question: would it make sense to have a mode for
> > inquiring about several noderev's successors at the same time, in order
> > to amortize the lookup operation (three open()s, several read()s of the
> > successor data file)'s cost across them?
> 
> Yes, sure.
> 
> The API could take a list of node-revision IDs and return a hash table
> keyed on node-revision ID, with the values being arrays of revision offsets.
> 

+1

> > > When committing a finished transaction (in fs_fs.c:commit_body()),
> > > the writer obtains a write-lock on all files within the successors
> > > directory it needs to modify.
> > > 
> > 
> > Clarification: do you propose that a single write lock be obtained to
> > cover the entire successors/** hierarchy?
> > 
> > Also: is the FSFS write lock (which is held by commit_body() whilst it
> > runs) sufficient for this?
> 
> I would say the global write lock is sufficient.
> 
> I wasn't sure how exactly locking in FSFS works. You can read this
> statement as "the writer has some kind of write lock on everything it
> needs to change, possibly implied by a global write lock".
> 

Okay.

> > > If rN falls within the revision range of an existing revision map file,
> > > it copies ((N-2)*4) bytes of content from the revision map file into the
> > > temporary file. (Note that this step discards data left behind by any
> > > previous attempt at committing rN).
> > > 
> > 
> > ((N-2)*4) is bogus for N=1, and it's also missing a modulo operation
> > somewhere.
> 
> Math is hard!
> 
> > But the gist is clear: copy the four bytes of every
> > preceding revision in the shard.
> 
> Yup.
>  

:-)

> > > If the writer fails to update all files in the successors directory,
> > > it will also fail to update 'current'. In this case, readers will ignore
> > > the new successor data until another commit succeeds in creating rN.
> > > Once a commit of rN has succeeded, readers following node-rev-id map
> > > entries updated by the failing commit might end up with an empty
> > > successors list for rN. Such erroneous entries will not be cleaned
> > > up automatically, but can be eliminated by re-creating the repository
> > > (e.g. via a dump/load cycle).
> > 
> > Near the start you referred to a 'compromise'.  Is that the compromise?
> 
> This is what I meant, yes.
> 

Okay.

> > That said, one issue.  Debris in the successor data file or revision map
> > file will be overwritten as soon as an attempt to commit rN succeeds;
> > however, noderev map entries remain until a dump/load.  How feasible
> > would it be to also have a cleanup procedure for them?
> 
> I am happy to just leave this debris in the files for now.
> 
> I would guess that nobody will ever even notice this problem in practice.
> The number of commits failing within the time window where succesor data
> is updated will statistically be very low to begin with.
> Each time it happens we lose a very small fraction of disk space. We also
> suffer a teeny tiny bit of read performance loss for readers of successors
> of the affected node-revision. So what...
> 
> If it ever becomes a real problem, people can dump/load.
> 

It'll work, but it's a kill a fly with a fleet approach :).  Dump/load
the entire history (many MB of svndiffs) only to fix some derived
noderev->offset map data?

> One thing that is a little annoying is that file checksums for successor
> maps in production and backup repositories might not match up. Folks
> might then infer that there is some kind of repository corruption.
> So we should make 'svnadmin verify' print a warning if we find debris
> entries in the map. The warning would also explain that this is harmless
> and does not constitute corruption.
> 

Sounds fine.  (And another job for the svn_fs__verify() API!)

> > Such a cleanup procedure could grab the commits (and successors, if it's
> > separate) write locks, attempt to read the successors data for all
> > noderevs in a noderev map file, and delete the bogus entries from the
> > noderev map file.  (It could delete them either by atomically replacing
> > the file with a new version, or by replacing the first byte of the
> > revnum by some agreed non-[0-9\n] byte representing "This line of the
> > noderev map file shall be ignored".
> > 
> > (it must be a single byte because some revnums are represented in ASCII
> > by a single byte)
> 
> If you really want to we can always bolt something like this on later.

Agreed.

(Did I just propose another instance of "Let's bolt <another> thing on FSFS!"?)

Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Sun, Sep 04, 2011 at 02:23:28AM +0300, Daniel Shahaf wrote:
> > Each node-revision map file stores a mapping of the form
> >   node_rev_id => [revnum, revnum, ...]
> > The revision numbers identify those revisions in which one or more
> > successors of a given node-revision were added.
> > 
> > In the first iteration of this design, the mapping is stored as lines
> > of ASCII text. Each line contains an unparsed node_revision_id, a space
> > separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> > (The design may be refined later to use a more efficient storage model.)
> > 
> 
> Some wild ideas here:
> 
> - Multiple, concatenated DELTA\n..ENDREP\n svndiff streams.  That's
>   appendable and compressible (a deltify process can combin adjacent
>   streams, which will trigger zlib compression).
> 
>   (That won't work for the successors data since offsets into it are
>   stored elsewhere.)

So you're saying that we should run the plaintext proposed above
through svndiff? Can you explain in more detail how this would work?
What is the base of a delta?

> - Some packing support that also implements grouping of all the map
>   entries with the same lhs?  Is that needed, or should the API at this

What is 'lhs'?

>   level only implement 'slurp the entire map into a hash mapping ints to
>   arrays of revnums'?

I would say so, with the possible retriction that only a certain range
of revisions is slurped into memory. This way, the API can optimise for
callers which are only interested in a certain revision range (e.g.
graphing tools can reconstruct the history of a file in several steps).
 
> (OT: those terms for the three kinds of files you have haven't sat well
> with me; from the term alone I couldn't instantly recall what the
> contents of the file were.)

What would you call them? I'm open to suggestions.

> > In exceptional cases (see below) it is possible that a given revision
> > does not actually contain any successors of the given node_rev_id.
> > 
> 
> Implementation question: would it make sense to have a mode for
> inquiring about several noderev's successors at the same time, in order
> to amortize the lookup operation (three open()s, several read()s of the
> successor data file)'s cost across them?

Yes, sure.

The API could take a list of node-revision IDs and return a hash table
keyed on node-revision ID, with the values being arrays of revision offsets.

> > When committing a finished transaction (in fs_fs.c:commit_body()),
> > the writer obtains a write-lock on all files within the successors
> > directory it needs to modify.
> > 
> 
> Clarification: do you propose that a single write lock be obtained to
> cover the entire successors/** hierarchy?
> 
> Also: is the FSFS write lock (which is held by commit_body() whilst it
> runs) sufficient for this?

I would say the global write lock is sufficient.

I wasn't sure how exactly locking in FSFS works. You can read this
statement as "the writer has some kind of write lock on everything it
needs to change, possibly implied by a global write lock".

> > If rN falls within the revision range of an existing revision map file,
> > it copies ((N-2)*4) bytes of content from the revision map file into the
> > temporary file. (Note that this step discards data left behind by any
> > previous attempt at committing rN).
> > 
> 
> ((N-2)*4) is bogus for N=1, and it's also missing a modulo operation
> somewhere.

Math is hard!

> But the gist is clear: copy the four bytes of every
> preceding revision in the shard.

Yup.
 
> > If the writer fails to update all files in the successors directory,
> > it will also fail to update 'current'. In this case, readers will ignore
> > the new successor data until another commit succeeds in creating rN.
> > Once a commit of rN has succeeded, readers following node-rev-id map
> > entries updated by the failing commit might end up with an empty
> > successors list for rN. Such erroneous entries will not be cleaned
> > up automatically, but can be eliminated by re-creating the repository
> > (e.g. via a dump/load cycle).
> 
> Near the start you referred to a 'compromise'.  Is that the compromise?

This is what I meant, yes.

> That said, one issue.  Debris in the successor data file or revision map
> file will be overwritten as soon as an attempt to commit rN succeeds;
> however, noderev map entries remain until a dump/load.  How feasible
> would it be to also have a cleanup procedure for them?

I am happy to just leave this debris in the files for now.

I would guess that nobody will ever even notice this problem in practice.
The number of commits failing within the time window where succesor data
is updated will statistically be very low to begin with.
Each time it happens we lose a very small fraction of disk space. We also
suffer a teeny tiny bit of read performance loss for readers of successors
of the affected node-revision. So what...

If it ever becomes a real problem, people can dump/load.

One thing that is a little annoying is that file checksums for successor
maps in production and backup repositories might not match up. Folks
might then infer that there is some kind of repository corruption.
So we should make 'svnadmin verify' print a warning if we find debris
entries in the map. The warning would also explain that this is harmless
and does not constitute corruption.

> Such a cleanup procedure could grab the commits (and successors, if it's
> separate) write locks, attempt to read the successors data for all
> noderevs in a noderev map file, and delete the bogus entries from the
> noderev map file.  (It could delete them either by atomically replacing
> the file with a new version, or by replacing the first byte of the
> revnum by some agreed non-[0-9\n] byte representing "This line of the
> noderev map file shall be ignored".
> 
> (it must be a single byte because some revnums are represented in ASCII
> by a single byte)

If you really want to we can always bolt something like this on later.

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Sperling wrote on Thu, Sep 01, 2011 at 00:06:40 +0200:
> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> > I'll try to tweak my proposal such that successor ID updates become
> > transactional and happen as part of every commit.
> 
> Here's a first shot at this. Comments welcome.
> 
> To support transactional semantics, any new data written to the FSFS
> filesystem must be tied to the newly created revision.
> This way, updating 'current' as the final step in each commit makes the
> new data appear. Failing to update 'current' will cause any data added
> by failing commits to be overwritten by the next successfull commit.
> 
> The following design tries to keep this principle valid for sucessor-ids.

This indeed induces nice properties on the design.

Aside from the auto-overwrite properties you mention, it also lends
itself to 'old files are read-only' (to the degree that property is
possible to do for the successor-ids feature) and to making storage
requirements uniform: some noderevs --- for example, the noderev for
^/subversion/trunk/BUGS --- are copied much more than others, and this
design has the storage propertional to the revisions of the copy targets
rather than the copy sources.

> There is one compromise though which hopefully won't hurt too much.
> 
> Thanks again to danielsh for providing some initial feedback via IRC.
> 
> 
> Implementation proposal
> =======================
> 
> 
> - On-disk File Format -
> 
> All files related to the map are stored in the repos/db/successors
> directory ("successors directory").
> 
> The successor map contains 3 kinds of files:
> 
>  1) Node-revision map files.
>     These files map node-revision IDs to lists of revisions.
>     This map answers the question:
>     "In which revisions were successors of a node-revision created?"
> 
>  2) Created-revision map file.
>     These files map revision numbers of file offsets in successor data files.
>     This map answers the question:
>     "Where in the successor data are those successors which were created
>      in revision N?"
> 
>  3) Successor data files.
>     These files store raw successor data.
> 
> The different kinds of files refer to each other as shown below:
> 
>                  node_rev_id map     revision map      successor data
>                    +-------+          +--------+       +-----------+
>                    |  ...  |          |  ...   |       | ...       |
>               +------>rN---------+    | offset |  +----->successor | 
>               |    |  ...  |     |    | offset |  |    | successor |
>               |    |  ...  |     +----->offset----+    | END       |
> node_rev_id --+    |  rQ   |          |  ...   |       | successor |
>               |    |  ...  |          | offset |       | ...       |
>               +------>rX---------+    |  ...   |       | ...       |
>                    |  ...  |     |    |  ...   |       | END       |
>                    |  ...  |     +----->offset---------->successor |
>                    |  rZ   |          |  ...   |       | END       |
>                    +-------+          +--------+       +-----------+
> 
> Each type of file is described below in detail.
> 
> 
> -- Node-revision-ID map files --
> 
> The purpose of node-revision-id map files is to facilitate lookup
> of successors of a given node-revision.
> 
> The files are named after the number of the revision the node-revision
> was created in, modulo some fixed to-be-determined value (i.e. there
> won't be one file per node revision, but one file for every 10, 100,
> or so, node-revisions).
> 
> The files are stored within the "node-revs" sub-directory:
> 
>  db/successors/node-revs/1
>  db/successors/node-revs/2
>  db/successors/node-revs/3
>  ...
>  db/successors/node-revs/N
>  db/successors/node-revs/M
>  ...
> 
> Each node-revision map file stores a mapping of the form
>   node_rev_id => [revnum, revnum, ...]
> The revision numbers identify those revisions in which one or more
> successors of a given node-revision were added.
> 
> In the first iteration of this design, the mapping is stored as lines
> of ASCII text. Each line contains an unparsed node_revision_id, a space
> separator (ASCII 0x20), and a revision number in ASCII decimal representation.
> (The design may be refined later to use a more efficient storage model.)
> 

Some wild ideas here:

- Multiple, concatenated DELTA\n..ENDREP\n svndiff streams.  That's
  appendable and compressible (a deltify process can combin adjacent
  streams, which will trigger zlib compression).

  (That won't work for the successors data since offsets into it are
  stored elsewhere.)

- Some packing support that also implements grouping of all the map
  entries with the same lhs?  Is that needed, or should the API at this
  level only implement 'slurp the entire map into a hash mapping ints to
  arrays of revnums'?

> 

(OT: those terms for the three kinds of files you have haven't sat well
with me; from the term alone I couldn't instantly recall what the
contents of the file were.)

> -- Revision map files --
> 
> The purpose of the revision map file is to facilitate lookup
> of successor data created in a given revision.
> 
> The files are named after the numbers of revisions they are responsible for,
> modulo some fixed to-be-determined value (i.e. there won't be one file
> per revision, but one file for every 10, 100, or so, revisions; each file
> is responsible for some range of revisions).
> 
> The files are stored within the "revs" sub-directory:
> 
>  db/successors/revs/1
>  db/successors/revs/2
>  db/successors/revs/3
>  ...
>  db/successors/revs/N
>  db/successors/revs/M
>  ...
> 
> 
> Each file consists of an array of 64bit big-endian integers.
> The N'th integer (starting from N=1) in the file specifies the offset
> of successor data (in the corresponding successor data file) which was
> added in the N'th revision within the revision range the map file is
> responsible for.
> 
> 
> -- Successor-data files --
> 
> These files use the same naming scheme as the revision map files (i.e.
> each successor data file is responsible for successors created within
> a specific range of revisions).
> 
> The files are stored within the "successor-ids" sub-directory:
> 
>  db/successors/successor-ids/1
>  db/successors/successor-ids/2
>  db/successors/successor-ids/3
>  ...
>  db/successors/successor-ids/N
>  db/successors/successor-ids/M
>  ...
> 
> 
> Each file consists of lines each containing two unparsed node_rev_id
> ASCII strings, separated by ASCII 0x20. The first node-revision ID is a
> predecessor, and the second one is one of its successors.
> The special string "END" on a single line signifies that the following
> successor data belongs to the next revision.
> (The design may be refined later to use a more efficient storage model.)
> 
> 
> - Procedure for Readers -
> 
> A reader first reads the 'current' file and remembers the revision number.
> 
> It locates the node-revision map file corresponding to the given node_rev_id,
> based on the node_rev_id's revision modulo a known fixed value.
> 
> It opens this file and reads it to obtain a list of revisions which created
> successors of the given node_rev_id. It ignores lines listing node_rev_ids
> not matching the given node_rev_id. It also ignores revisions greater than
> 'current'.
> 
> Next, the reader opens the corresponding revision map files,
> based on revision numbers in the list, each modulo a known fixed value.
> For each revision N in the revision list, it seeks to the revision's byte
> offset within a revision map file and reads 4 bytes to obtain the offset
> corresponding to revision N within a successor data file.
> 
> The reader then opens the corresponding successor data files,

<Implementation note>
When multiple successors are available, there is a performance trade-off
here: what order to iterate the revision map files and successor data
files, when multiple revisions in multiple shards contain successors.
</Implementation note>

/me fondly recalls an exercise from BSc: "Compare the 3! possible
orderings of the elementary algorithm for N×N matrix multiplication"
(for a fixed C representation of matrices).
</Implementation note>

> based on revision numbers in the list, each modulo a known fixed value.
> For each revision offset, it seeks to this offset and reads lines until it
> finds the line "END". It loops through all lines and appends those
> successor IDs to the set of successors which list a predecessor matching
> the given node_rev_id.
(would be easier to parse if the words "to the set of successors" were relocated)
> 
> In exceptional cases (see below) it is possible that a given revision
> does not actually contain any successors of the given node_rev_id.
> 

Implementation question: would it make sense to have a mode for
inquiring about several noderev's successors at the same time, in order
to amortize the lookup operation (three open()s, several read()s of the
successor data file)'s cost across them?

> 
> - Procedure for Writers -
> 
> Assume that the writer has access to a mapping from predecessor
> node-revision ID to one or more successor node-revision IDs.
> (The exact mechanism managing this mapping still needs to be determined.
> But it can be assumed to be available as all predecessors and successors
> are known within the context of a commit.)
> 
> When committing a finished transaction (in fs_fs.c:commit_body()),
> the writer obtains a write-lock on all files within the successors
> directory it needs to modify.
> 

Clarification: do you propose that a single write lock be obtained to
cover the entire successors/** hierarchy?

Also: is the FSFS write lock (which is held by commit_body() whilst it
runs) sufficient for this?

> Let rN be the new revision the writer is committing.
> 
> The writer creates an empty temporary file.
> 
> If rN falls within the revision range of an existing successor data
> file, the writer looks up the offset of revision N-1 in the corresponding
> revision map file. It copies content from the corresponding successor data
> file to a temporary file up to this offset, and copies any data that follows
> until a line containing "END" has been copied. (Note that this step discards
> data left behind by any previous attempt at committing rN).  
> 
> The writer appends any new successor data to the temporary file (note
> that there may be no successor data in case the revision is empty).
> It then adds a terminating "END".
> 
> The writer creates another empty temporary file.
> 
> If rN falls within the revision range of an existing revision map file,
> it copies ((N-2)*4) bytes of content from the revision map file into the
> temporary file. (Note that this step discards data left behind by any
> previous attempt at committing rN).
> 

((N-2)*4) is bogus for N=1, and it's also missing a modulo operation
somewhere.  But the gist is clear: copy the four bytes of every
preceding revision in the shard.

(Which reminds me that r0's single noderev has successors too.)

> The writer appends, to the temporary file, the offset of the new data
> it wrote into the successor data file.
> 
> Likewise, the writer creates empty temporary files for each node-revision
> map files is needs to modify. It copies all content from any such
> node-revision map files which already exist, and appends a line to each
> file containing the node_revision ID and the new revision number.
> 

It may append multiple lines in case multiple noderevs are covered by
a single noderev map file.

> After moving the proto-revprop file into place, and before updating
> 'current', the writer moves temporary files into place in the successors
> directory, in the following order:
> 
>   1) the new successor data file
>   2) the new revision map file
>   3) each new node_rev_id map file
> 
> If the writer fails to update all files in the successors directory,
> it will also fail to update 'current'. In this case, readers will ignore
> the new successor data until another commit succeeds in creating rN.
> Once a commit of rN has succeeded, readers following node-rev-id map
> entries updated by the failing commit might end up with an empty
> successors list for rN. Such erroneous entries will not be cleaned
> up automatically, but can be eliminated by re-creating the repository
> (e.g. via a dump/load cycle).

Near the start you referred to a 'compromise'.  Is that the compromise?
Does 'compromise' refer to something else?

> However, the actual successor data stored
> for committed revisions is always correct, and the likelihood of incorrect
> node-revision ID map entries to occur is small.

Suppose committing rN failed.  That might have left debris in one of the
following forms:

- Appends to the noderev map files.

- Appends to the revision map file.

- Appends to the successor data file.

Revision map data is ignored until 'current' is >=N; therefore, if
a reader opens the map file at rN's offset, then an attempt to commit rN
has succeeded, which means readers will never see the successor data of
failed commit attempts.

/me satisfied.

> 

That said, one issue.  Debris in the successor data file or revision map
file will be overwritten as soon as an attempt to commit rN succeeds;
however, noderev map entries remain until a dump/load.  How feasible
would it be to also have a cleanup procedure for them?

Such a cleanup procedure could grab the commits (and successors, if it's
separate) write locks, attempt to read the successors data for all
noderevs in a noderev map file, and delete the bogus entries from the
noderev map file.  (It could delete them either by atomically replacing
the file with a new version, or by replacing the first byte of the
revnum by some agreed non-[0-9\n] byte representing "This line of the
noderev map file shall be ignored".

(it must be a single byte because some revnums are represented in ASCII
by a single byte)

> 

+1


Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Mon, Sep 5, 2011 at 21:55, Stefan Sperling <st...@elego.de> wrote:
> On Mon, Sep 05, 2011 at 07:38:51PM +0200, Bert Huijben wrote:
>> > Also slightly OT (no FSFS-guruness here), but I think another
>> > important use-case is being able to quickly answer the question "in
>> > which revision was $URL@$REV deleted?" Or "give me the log of
>> > $URL@$REV up and until it was deleted."
>>
>> svn_ra_get_deleted_rev(), which answers this question was introduced in 1.6.
>> (But I don't know where it is used.)
>
> The guts of this are in svn_repos_deleted_rev().
> It runs a binary search across a specific revision range to find the
> delete event. During this search it repeatedly scans history backwards
> to find a related node and has to employ a complicated set of rules to
> deal with copies and replacements.
>
> This could be simplified with successor support. Just scan successors
> upwards from the start revision until either the end revision is reached
> or the node disappears. The only complication are multiple copies of the
> node. If more than one copy event is found they will all need to be
> scanned upwards but only one will match the node we're looking for.
>
It looks like just performance optimization. We can do it later, after
fixing tree-conflicts on moves.

-- 
Ivan Zhakov

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Sep 05, 2011 at 07:38:51PM +0200, Bert Huijben wrote:
> > Also slightly OT (no FSFS-guruness here), but I think another
> > important use-case is being able to quickly answer the question "in
> > which revision was $URL@$REV deleted?" Or "give me the log of
> > $URL@$REV up and until it was deleted."
> 
> svn_ra_get_deleted_rev(), which answers this question was introduced in 1.6.
> (But I don't know where it is used.)

The guts of this are in svn_repos_deleted_rev().
It runs a binary search across a specific revision range to find the
delete event. During this search it repeatedly scans history backwards
to find a related node and has to employ a complicated set of rules to
deal with copies and replacements.

This could be simplified with successor support. Just scan successors
upwards from the start revision until either the end revision is reached
or the node disappears. The only complication are multiple copies of the
node. If more than one copy event is found they will all need to be
scanned upwards but only one will match the node we're looking for.

RE: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Johan Corveleyn [mailto:jcorvel@gmail.com]
> Sent: maandag 5 september 2011 13:15
> To: Ivan Zhakov; dev@subversion.apache.org
> Subject: Re: question for FSFS gurus (was: Re: FSFS successor ID design
draft)
> 
> On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
> [...]
> > I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
> > worth to be implemented because there is no strong reasons/usage for
> > it. For me it looks like bottom-up design approach.
> 
> Also slightly OT (no FSFS-guruness here), but I think another
> important use-case is being able to quickly answer the question "in
> which revision was $URL@$REV deleted?" Or "give me the log of
> $URL@$REV up and until it was deleted."

svn_ra_get_deleted_rev(), which answers this question was introduced in 1.6.
(But I don't know where it is used.)

> This question comes up in practice once in a while (has been asked a
> couple of times on the users-list, and to me personally by some of my
> dev-colleagues). The workaround is usually to script around it for
> example by doing a "log -v" of an ancestor and looking for the first
> deletion after $REV (or something similar).
> 
> When it is suggested that this kind of search could also be
> implemented as a feature inside svn (even if it's slow, at least it
> would be part of the core), that request is always refused by saying
> that it wouldn't be performant enough (and consequently eat too much
> resources of an SVN server). I hope an FSFS successor ID storage could
> help in this regard.

	Bert
> 
> --
> Johan


Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Johan Corveleyn <jc...@gmail.com>.
On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
[...]
> I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
> worth to be implemented because there is no strong reasons/usage for
> it. For me it looks like bottom-up design approach.

Also slightly OT (no FSFS-guruness here), but I think another
important use-case is being able to quickly answer the question "in
which revision was $URL@$REV deleted?" Or "give me the log of
$URL@$REV up and until it was deleted."

This question comes up in practice once in a while (has been asked a
couple of times on the users-list, and to me personally by some of my
dev-colleagues). The workaround is usually to script around it for
example by doing a "log -v" of an ancestor and looking for the first
deletion after $REV (or something similar).

When it is suggested that this kind of search could also be
implemented as a feature inside svn (even if it's slow, at least it
would be part of the core), that request is always refused by saying
that it wouldn't be performant enough (and consequently eat too much
resources of an SVN server). I hope an FSFS successor ID storage could
help in this regard.

-- 
Johan

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Sep 05, 2011 at 12:12:31PM +0200, Stefan Sperling wrote:
> > Anyway we can implement top-level part of handling moves and then
> > optimize it using FSFS successor ID storage or something else.
> 
> We need the bottom parts first.
> It won't perform well enough without successor-IDs.

I should have been clearer:

"It" in my quote above is not the algorithm I posted pseudo-code for.
"It" is a different algorithm that finds nodes which were deleted in
the history of the target branch of a merge (again,
see http://svn.haxx.se/dev/archive-2011-08/0624.shtml).

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Sep 05, 2011 at 01:57:51PM +0400, Ivan Zhakov wrote:
> No, I think that auto-resolving tree-conflicts involving moves is most
> important task for Subversion 1.8. But I feel it could be implemented
> without implementing FSFS successor ID storage. It seems that
> algorithm that you posted could be reversed.

I have caused some confusion here, sorry.
Contrary to how I initially presented it, That algorithm has *nothing*
to do with successor IDs.  

Please see this post where I corrected my mistake:
http://svn.haxx.se/dev/archive-2011-08/0624.shtml

> Anyway we can implement top-level part of handling moves and then
> optimize it using FSFS successor ID storage or something else.

We need the bottom parts first.
It won't perform well enough without successor-IDs.

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Mon, Sep 5, 2011 at 14:43, Daniel Shahaf <da...@elego.de> wrote:
> Ivan Zhakov wrote on Mon, Sep 05, 2011 at 13:57:51 +0400:
>> On Mon, Sep 5, 2011 at 13:52, Stefan Sperling <st...@elego.de> wrote:
>> > On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
>> >> On Mon, Sep 5, 2011 at 13:41, Stefan Sperling <st...@elego.de> wrote:
>> >> > On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
>> >> >> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
>> >> >> > I'll try to tweak my proposal such that successor ID updates become
>> >> >> > transactional and happen as part of every commit.
>> >> >>
>> >> >> Here's a first shot at this. Comments welcome.
>> >> >
>> >> > FSFS gurus:
>> >> >
>> >> > Are any of you looking at this?
>> >> > Do you think this is worth writing a prototype implementation for?
>> >> >
>> >> > I have so far only received feedback from danielsh. This makes me very
>> >> > happy but if anyone with a couple more years of FSFS experience under
>> >> > their belt could comment I would be even happier.
>> >> >
>> >> I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
>> >> worth to be implemented because there is no strong reasons/usage for
>> >> it. For me it looks like bottom-up design approach.
>> >
>> > Hmmm... you don't think that auto-resolving tree-conflicts involving
>> > moves during merges is worth implementing?
>> >
>> No, I think that auto-resolving tree-conflicts involving moves is most
>> important task for Subversion 1.8. But I feel it could be implemented
>> without implementing FSFS successor ID storage. It seems that
>> algorithm that you posted could be reversed.
>>
>
> What do you mean by 'reversed'?
>
I mean to iterate over added_nodes (instead of deleted_nodes).

-- 
Ivan Zhakov

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Daniel Shahaf <da...@elego.de>.
Ivan Zhakov wrote on Mon, Sep 05, 2011 at 13:57:51 +0400:
> On Mon, Sep 5, 2011 at 13:52, Stefan Sperling <st...@elego.de> wrote:
> > On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
> >> On Mon, Sep 5, 2011 at 13:41, Stefan Sperling <st...@elego.de> wrote:
> >> > On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
> >> >> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> >> >> > I'll try to tweak my proposal such that successor ID updates become
> >> >> > transactional and happen as part of every commit.
> >> >>
> >> >> Here's a first shot at this. Comments welcome.
> >> >
> >> > FSFS gurus:
> >> >
> >> > Are any of you looking at this?
> >> > Do you think this is worth writing a prototype implementation for?
> >> >
> >> > I have so far only received feedback from danielsh. This makes me very
> >> > happy but if anyone with a couple more years of FSFS experience under
> >> > their belt could comment I would be even happier.
> >> >
> >> I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
> >> worth to be implemented because there is no strong reasons/usage for
> >> it. For me it looks like bottom-up design approach.
> >
> > Hmmm... you don't think that auto-resolving tree-conflicts involving
> > moves during merges is worth implementing?
> >
> No, I think that auto-resolving tree-conflicts involving moves is most
> important task for Subversion 1.8. But I feel it could be implemented
> without implementing FSFS successor ID storage. It seems that
> algorithm that you posted could be reversed.
> 

What do you mean by 'reversed'?

> Anyway we can implement top-level part of handling moves and then
> optimize it using FSFS successor ID storage or something else.
> 
> 
> -- 
> Ivan Zhakov

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Mon, Sep 5, 2011 at 13:52, Stefan Sperling <st...@elego.de> wrote:
> On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
>> On Mon, Sep 5, 2011 at 13:41, Stefan Sperling <st...@elego.de> wrote:
>> > On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
>> >> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
>> >> > I'll try to tweak my proposal such that successor ID updates become
>> >> > transactional and happen as part of every commit.
>> >>
>> >> Here's a first shot at this. Comments welcome.
>> >
>> > FSFS gurus:
>> >
>> > Are any of you looking at this?
>> > Do you think this is worth writing a prototype implementation for?
>> >
>> > I have so far only received feedback from danielsh. This makes me very
>> > happy but if anyone with a couple more years of FSFS experience under
>> > their belt could comment I would be even happier.
>> >
>> I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
>> worth to be implemented because there is no strong reasons/usage for
>> it. For me it looks like bottom-up design approach.
>
> Hmmm... you don't think that auto-resolving tree-conflicts involving
> moves during merges is worth implementing?
>
No, I think that auto-resolving tree-conflicts involving moves is most
important task for Subversion 1.8. But I feel it could be implemented
without implementing FSFS successor ID storage. It seems that
algorithm that you posted could be reversed.

Anyway we can implement top-level part of handling moves and then
optimize it using FSFS successor ID storage or something else.


-- 
Ivan Zhakov

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Sep 05, 2011 at 01:46:09PM +0400, Ivan Zhakov wrote:
> On Mon, Sep 5, 2011 at 13:41, Stefan Sperling <st...@elego.de> wrote:
> > On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
> >> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> >> > I'll try to tweak my proposal such that successor ID updates become
> >> > transactional and happen as part of every commit.
> >>
> >> Here's a first shot at this. Comments welcome.
> >
> > FSFS gurus:
> >
> > Are any of you looking at this?
> > Do you think this is worth writing a prototype implementation for?
> >
> > I have so far only received feedback from danielsh. This makes me very
> > happy but if anyone with a couple more years of FSFS experience under
> > their belt could comment I would be even happier.
> >
> I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
> worth to be implemented because there is no strong reasons/usage for
> it. For me it looks like bottom-up design approach.

Hmmm... you don't think that auto-resolving tree-conflicts involving
moves during merges is worth implementing?

Re: question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Mon, Sep 5, 2011 at 13:41, Stefan Sperling <st...@elego.de> wrote:
> On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
>> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
>> > I'll try to tweak my proposal such that successor ID updates become
>> > transactional and happen as part of every commit.
>>
>> Here's a first shot at this. Comments welcome.
>
> FSFS gurus:
>
> Are any of you looking at this?
> Do you think this is worth writing a prototype implementation for?
>
> I have so far only received feedback from danielsh. This makes me very
> happy but if anyone with a couple more years of FSFS experience under
> their belt could comment I would be even happier.
>
I'm not FSFS guru, but I still feel that FSFS successor ID doesn't
worth to be implemented because there is no strong reasons/usage for
it. For me it looks like bottom-up design approach.

-- 
Ivan Zhakov

question for FSFS gurus (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Thu, Sep 01, 2011 at 12:06:40AM +0200, Stefan Sperling wrote:
> On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> > I'll try to tweak my proposal such that successor ID updates become
> > transactional and happen as part of every commit.
> 
> Here's a first shot at this. Comments welcome.

FSFS gurus:

Are any of you looking at this?
Do you think this is worth writing a prototype implementation for?

I have so far only received feedback from danielsh. This makes me very
happy but if anyone with a couple more years of FSFS experience under
their belt could comment I would be even happier.

Thanks.

Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Aug 30, 2011 at 12:43:29PM +0200, Stefan Sperling wrote:
> I'll try to tweak my proposal such that successor ID updates become
> transactional and happen as part of every commit.

Here's a first shot at this. Comments welcome.

To support transactional semantics, any new data written to the FSFS
filesystem must be tied to the newly created revision.
This way, updating 'current' as the final step in each commit makes the
new data appear. Failing to update 'current' will cause any data added
by failing commits to be overwritten by the next successfull commit.

The following design tries to keep this principle valid for sucessor-ids.
There is one compromise though which hopefully won't hurt too much.

Thanks again to danielsh for providing some initial feedback via IRC.


Implementation proposal
=======================


- On-disk File Format -

All files related to the map are stored in the repos/db/successors
directory ("successors directory").

The successor map contains 3 kinds of files:

 1) Node-revision map files.
    These files map node-revision IDs to lists of revisions.
    This map answers the question:
    "In which revisions were successors of a node-revision created?"

 2) Created-revision map file.
    These files map revision numbers of file offsets in successor data files.
    This map answers the question:
    "Where in the successor data are those successors which were created
     in revision N?"

 3) Successor data files.
    These files store raw successor data.

The different kinds of files refer to each other as shown below:

                 node_rev_id map     revision map      successor data
                   +-------+          +--------+       +-----------+
                   |  ...  |          |  ...   |       | ...       |
              +------>rN---------+    | offset |  +----->successor | 
              |    |  ...  |     |    | offset |  |    | successor |
              |    |  ...  |     +----->offset----+    | END       |
node_rev_id --+    |  rQ   |          |  ...   |       | successor |
              |    |  ...  |          | offset |       | ...       |
              +------>rX---------+    |  ...   |       | ...       |
                   |  ...  |     |    |  ...   |       | END       |
                   |  ...  |     +----->offset---------->successor |
                   |  rZ   |          |  ...   |       | END       |
                   +-------+          +--------+       +-----------+

Each type of file is described below in detail.


-- Node-revision-ID map files --

The purpose of node-revision-id map files is to facilitate lookup
of successors of a given node-revision.

The files are named after the number of the revision the node-revision
was created in, modulo some fixed to-be-determined value (i.e. there
won't be one file per node revision, but one file for every 10, 100,
or so, node-revisions).

The files are stored within the "node-revs" sub-directory:

 db/successors/node-revs/1
 db/successors/node-revs/2
 db/successors/node-revs/3
 ...
 db/successors/node-revs/N
 db/successors/node-revs/M
 ...

Each node-revision map file stores a mapping of the form
  node_rev_id => [revnum, revnum, ...]
The revision numbers identify those revisions in which one or more
successors of a given node-revision were added.

In the first iteration of this design, the mapping is stored as lines
of ASCII text. Each line contains an unparsed node_revision_id, a space
separator (ASCII 0x20), and a revision number in ASCII decimal representation.
(The design may be refined later to use a more efficient storage model.)


-- Revision map files --

The purpose of the revision map file is to facilitate lookup
of successor data created in a given revision.

The files are named after the numbers of revisions they are responsible for,
modulo some fixed to-be-determined value (i.e. there won't be one file
per revision, but one file for every 10, 100, or so, revisions; each file
is responsible for some range of revisions).

The files are stored within the "revs" sub-directory:

 db/successors/revs/1
 db/successors/revs/2
 db/successors/revs/3
 ...
 db/successors/revs/N
 db/successors/revs/M
 ...


Each file consists of an array of 64bit big-endian integers.
The N'th integer (starting from N=1) in the file specifies the offset
of successor data (in the corresponding successor data file) which was
added in the N'th revision within the revision range the map file is
responsible for.


-- Successor-data files --

These files use the same naming scheme as the revision map files (i.e.
each successor data file is responsible for successors created within
a specific range of revisions).

The files are stored within the "successor-ids" sub-directory:

 db/successors/successor-ids/1
 db/successors/successor-ids/2
 db/successors/successor-ids/3
 ...
 db/successors/successor-ids/N
 db/successors/successor-ids/M
 ...


Each file consists of lines each containing two unparsed node_rev_id
ASCII strings, separated by ASCII 0x20. The first node-revision ID is a
predecessor, and the second one is one of its successors.
The special string "END" on a single line signifies that the following
successor data belongs to the next revision.
(The design may be refined later to use a more efficient storage model.)


- Procedure for Readers -

A reader first reads the 'current' file and remembers the revision number.

It locates the node-revision map file corresponding to the given node_rev_id,
based on the node_rev_id's revision modulo a known fixed value.

It opens this file and reads it to obtain a list of revisions which created
successors of the given node_rev_id. It ignores lines listing node_rev_ids
not matching the given node_rev_id. It also ignores revisions greater than
'current'.

Next, the reader opens the corresponding revision map files,
based on revision numbers in the list, each modulo a known fixed value.
For each revision N in the revision list, it seeks to the revision's byte
offset within a revision map file and reads 4 bytes to obtain the offset
corresponding to revision N within a successor data file.

The reader then opens the corresponding successor data files,
based on revision numbers in the list, each modulo a known fixed value.
For each revision offset, it seeks to this offset and reads lines until it
finds the line "END". It loops through all lines and appends those
successor IDs to the set of successors which list a predecessor matching
the given node_rev_id.

In exceptional cases (see below) it is possible that a given revision
does not actually contain any successors of the given node_rev_id.


- Procedure for Writers -

Assume that the writer has access to a mapping from predecessor
node-revision ID to one or more successor node-revision IDs.
(The exact mechanism managing this mapping still needs to be determined.
But it can be assumed to be available as all predecessors and successors
are known within the context of a commit.)

When committing a finished transaction (in fs_fs.c:commit_body()),
the writer obtains a write-lock on all files within the successors
directory it needs to modify.

Let rN be the new revision the writer is committing.

The writer creates an empty temporary file.

If rN falls within the revision range of an existing successor data
file, the writer looks up the offset of revision N-1 in the corresponding
revision map file. It copies content from the corresponding successor data
file to a temporary file up to this offset, and copies any data that follows
until a line containing "END" has been copied. (Note that this step discards
data left behind by any previous attempt at committing rN).  

The writer appends any new successor data to the temporary file (note
that there may be no successor data in case the revision is empty).
It then adds a terminating "END".

The writer creates another empty temporary file.

If rN falls within the revision range of an existing revision map file,
it copies ((N-2)*4) bytes of content from the revision map file into the
temporary file. (Note that this step discards data left behind by any
previous attempt at committing rN).

The writer appends, to the temporary file, the offset of the new data
it wrote into the successor data file.

Likewise, the writer creates empty temporary files for each node-revision
map files is needs to modify. It copies all content from any such
node-revision map files which already exist, and appends a line to each
file containing the node_revision ID and the new revision number.

After moving the proto-revprop file into place, and before updating
'current', the writer moves temporary files into place in the successors
directory, in the following order:

  1) the new successor data file
  2) the new revision map file
  3) each new node_rev_id map file

If the writer fails to update all files in the successors directory,
it will also fail to update 'current'. In this case, readers will ignore
the new successor data until another commit succeeds in creating rN.
Once a commit of rN has succeeded, readers following node-rev-id map
entries updated by the failing commit might end up with an empty
successors list for rN. Such erroneous entries will not be cleaned
up automatically, but can be eliminated by re-creating the repository
(e.g. via a dump/load cycle). However, the actual successor data stored
for committed revisions is always correct, and the likelihood of incorrect
node-revision ID map entries to occur is small.

Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Aug 29, 2011 at 09:38:23PM -0400, C. Michael Pilato wrote:
> You're using the term "cache" alot, which is definitely not the approach I
> took with BDB.  In the BDB code, a missing successor map entry would signify
> a corruption of the filesystem.
 
> Also, the BDB code does all this stuff at the transaction level, not the
> revision level.  So, the node revision IDs of uncommitted nodes that live
> under txn roots can be the "value" of a successor-ID mapping.  You don't
> have to wait until the transaction is committed before that mapping is
> present or valid.  I'm not sure if that's useful or desireable, nor can I
> really tell if it differs from what you are proposing, but as neither the
> word "transaction" nor "txn" show up in your mail, it seemed I should call
> that out.

That's a very good point, thanks Mike!

The semantics of a cache make more sense at the repos-layer,
as described in the log cache design by Stefan^2.

If successor IDs are managed in the FS layer, successors should be
added in a transactional fashion, just like all other changes made
to the FS. The design I proposed doesn't take advantage of this at all.
It is therefore somewhat pointless. A cache might as well be layered
above the FS.

I'll try to tweak my proposal such that successor ID updates become
transactional and happen as part of every commit.

Note that a cache layered above the FS might still be a better idea
because it wouldn't require changes to existing backends and, in
general, solves the problems we're trying to address just as well.
But there is still enough time to explore all our options.

Re: FSFS successor ID design draft

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/26/2011 07:14 AM, Stefan Sperling wrote:
> cmpilato, does this significantly differ from what you've done for BDB?
> Will both backends have the same (or similar) behaviour if we use
> this design for FSFS?

Most of what you discuss in this proposal is related to the detailed
properties of the storage layer for successor-id mapping, all of which is
completely irrelevant to the BDB scenario where a real database engine with
well-defined semantics for this sort of thing already exist.

You're using the term "cache" alot, which is definitely not the approach I
took with BDB.  In the BDB code, a missing successor map entry would signify
a corruption of the filesystem.  I can't recall for sure, but I don't
believe I allowed for a way to distinguish between "This node has no
successors" and "This node's successors haven't been calculated yet".  (And
that was likely because I didn't want to also have to deal with "This node's
successors have been only partially calculated.")  Of course, it's not that
the data isn't calculable from the predecessor links we already store, but
simply that the code assume that the table has been properly maintained
since the origin of the repository, or at least backfilled in its entirety
while the repository was offline.  Not "cache"-like at all.

Also, the BDB code does all this stuff at the transaction level, not the
revision level.  So, the node revision IDs of uncommitted nodes that live
under txn roots can be the "value" of a successor-ID mapping.  You don't
have to wait until the transaction is committed before that mapping is
present or valid.  I'm not sure if that's useful or desireable, nor can I
really tell if it differs from what you are proposing, but as neither the
word "transaction" nor "txn" show up in your mail, it seemed I should call
that out.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Mon, Aug 29, 2011 at 08:04:09PM +0400, Danil Shopyrin wrote:
> Hi Stefan!
> 
> On Mon, Aug 29, 2011 at 6:54 PM, Stefan Sperling <st...@elego.de> wrote:
> > Any ideas about how to solve this without using successors?
> >
> > [[[
> >  deleted_nodes = nodes which appear in LEFT_PATH@LEFT_REV but not in
> >                  RIGHT_PATH@RIGHT_REV
> >  added_nodes = nodes which appear in RIGHT_PATH@RIGHT_REV but not in
> >                LEFT_PATH@LEFT_REV
> >  moved_nodes = []
> >  # A 'node' in this context is a repos_relpath
> >  for each node in deleted_nodes {
> >    node_rev_id = get_node_node_rev_id(node@LEFT_REV)
> 
> Just a question, why wouldn't you like to iterate over added_nodes
> (instead of deleted_nodes)? It seems that the whole algorithm can be
> reverted and work without successors. Or do I misunderstand something?

Sorry, I'm a real scatterbrain today and didn't recall the real reason
we need copy-to.

You're right, this algorithm can work quite well using existing copyfrom
information. Note that it calculates moves which happened in the delta
requested by the client, e.g. the incoming change during a merge.

The use case we need copy-to for is a different one. It is about nodes
which have moved in the merge target's history.

For instance, when the server tells the client to edit some file in the
merge target branch, the client may find the file missing from its working
copy. The client must now figure out what why this file is missing.
It could ask the server "I don't know about this file, can you tell me
if it has been deleted or if it was moved in my own history?"

To answer this question, the server must first track history of the file
in the merge source branch backwards, until it hits the branch point
(the common ancestor of this file in both branches). From the branch point,
it has to trace history on the merge target *forward* to see if it can
find a file in the merge target branch at HEAD which corresponds to the
file the merge must edit.

This forward tracing is what the successor-id and server-side log caching
proposals perform when they populate their caches.
It is still possible to use copyfrom for this lookup. However, without a
cache it is not likely to scale well because every file which appears
in the branch's HEAD revision is now in the added_nodes list.

Re: FSFS successor ID design draft

Posted by Danil Shopyrin <da...@visualsvn.com>.
Hi Stefan!

On Mon, Aug 29, 2011 at 6:54 PM, Stefan Sperling <st...@elego.de> wrote:
> Any ideas about how to solve this without using successors?
>
> [[[
>  deleted_nodes = nodes which appear in LEFT_PATH@LEFT_REV but not in
>                  RIGHT_PATH@RIGHT_REV
>  added_nodes = nodes which appear in RIGHT_PATH@RIGHT_REV but not in
>                LEFT_PATH@LEFT_REV
>  moved_nodes = []
>  # A 'node' in this context is a repos_relpath
>  for each node in deleted_nodes {
>    node_rev_id = get_node_node_rev_id(node@LEFT_REV)

Just a question, why wouldn't you like to iterate over added_nodes
(instead of deleted_nodes)? It seems that the whole algorithm can be
reverted and work without successors. Or do I misunderstand something?

Thanks!

-- 
With best regards,
Danil Shopyrin
VisualSVN Team

Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Fri, Aug 26, 2011 at 08:59:21AM -0400, C. Michael Pilato wrote:
> On 08/26/2011 08:51 AM, Ivan Zhakov wrote:
> > Mike, thanks for the answer, but I still don't understand how
> > successor ID information could help us with move handling?
> 
> (Honestly, I don't either.  I was taking Stefan's word for it.)

I don't have a proper design for this, just some vague ideas.
I believe that it will work.

The following (untested :) pseudo code shows what the server could
do to find out which nodes to report as moved from A to B within
the delta between LEFT_PATH@LEFT_REV and RIGHT_PATH@RIGHT_REV.

This doesn't strictly require a successor ID cache.
The IDs could also be computed at run-time if this is feasible.

It should be possible to produce the same result with just copyfrom
information and looping over added nodes instead of deleted nodes.
However it seems solving it that way is a lot less straightforward
(well, at least I cannot seem to wrap my brain around it...)
A successor ID cache has more uses besides this one, so it's worth
adding anyway even if we find another way of generating the moved_nodes
list below.

The following converts somewhat freely between paths and node-revision-id
In reality there would be more abstraction layers.

[[[
  deleted_nodes = nodes which appear in LEFT_PATH@LEFT_REV but not in
                  RIGHT_PATH@RIGHT_REV
  added_nodes = nodes which appear in RIGHT_PATH@RIGHT_REV but not in
                LEFT_PATH@LEFT_REV
  moved_nodes = []
  # A 'node' in this context is a repos_relpath
  for each node in deleted_nodes {
    node_rev_id = get_node_node_rev_id(node@LEFT_REV)
    successors = get_successors_in_rev_range(node_rev_id,
                                             [LEFT_REV, RIGHT_REV])
    lines_of_history = 0
    copied_successor = NULL
  
    # Look for copies among successors. Ignore copies to other branches.
    # Assume successor IDs are sorted in order, by revision.
    previous_s = node_rev_id
    for s in successors {
      if repos_relpath(s) is [a child of] LEFT_PATH or RIGHT_PATH {
        if get_copy_id(previous_s) != get_copy_id(s) {
  	  if copied_successor == NULL {
            copied_successor = s
  	    lines_of_history++
  	  } else if s is successor of copied_successor {
            copied_successor = s
  	  } else {
            lines_of_history++
            break
          }
        }
      }
      previous_s = s
    }
  
    # If a single copied successor is among added nodes, the node was moved
    if copied_successor and lines_of_history == 1:
      if added_nodes[repos_relpath(copied_successor)] != NULL:
        moved_from_repos_relpath = node
        moved_to_repos_relpath = repos_relpath(copied_successor)
        moved_nodes += (moved_from_repos_relpath, moved_to_repos_relpath)
  }
]]]

The moved_nodes list gives us the server-side op-roots of moves.
The editor driver still needs to use this smartly to make it useful.
I think it should send "move" commands upfront, e.g. right after open_root().
The client needs to move nodes in its working copy before other changes are
applied. The driver should also adjust paths of moved-along children properly.

Is there anything wrong with this?

Any ideas about how to solve this without using successors?

Re: FSFS successor ID design draft

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/26/2011 08:51 AM, Ivan Zhakov wrote:
> Mike, thanks for the answer, but I still don't understand how
> successor ID information could help us with move handling?

(Honestly, I don't either.  I was taking Stefan's word for it.)

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: FSFS successor ID design draft

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Fri, Aug 26, 2011 at 16:49, C. Michael Pilato <cm...@collab.net> wrote:
> On 08/26/2011 08:32 AM, Ivan Zhakov wrote:
>> On Fri, Aug 26, 2011 at 15:14, Stefan Sperling <st...@elego.de> wrote:
>>> Below is an initial draft of a design for successor-IDs in FSFS.
>>> It is a bit long, but I hope it covers all relevant points in
>>> sufficient detail.
>>>
>>> Please share your thoughts on this. I have most likely missed a few things.
>>>
>>> If we like it I will put it into the repository for further editing,
>>> either in notes/ on trunk or onto the fsfs-successor ID branch.
>>>
>>> If this design has some severe problem I don't mind going back to
>>> the drawing board. This is my first shot, afterall :)
>>>
>> Hi Stefan,
>>
>> I've may be stupid question: what is the purpose of storing
>> fsfs-successor ID? Do we need them for proper handling of
>> moves/renames or for something else?
>
> We need them for forward-direction history searching, move handing, and
> answering the top-ten enterprise question "I fixed this bug HERE, but in
> what other branches does the same bug exist?"
>
Mike, thanks for the answer, but I still don't understand how
successor ID information could help us with move handling?


-- 
Ivan Zhakov

Re: FSFS successor ID design draft

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/26/2011 08:32 AM, Ivan Zhakov wrote:
> On Fri, Aug 26, 2011 at 15:14, Stefan Sperling <st...@elego.de> wrote:
>> Below is an initial draft of a design for successor-IDs in FSFS.
>> It is a bit long, but I hope it covers all relevant points in
>> sufficient detail.
>>
>> Please share your thoughts on this. I have most likely missed a few things.
>>
>> If we like it I will put it into the repository for further editing,
>> either in notes/ on trunk or onto the fsfs-successor ID branch.
>>
>> If this design has some severe problem I don't mind going back to
>> the drawing board. This is my first shot, afterall :)
>>
> Hi Stefan,
> 
> I've may be stupid question: what is the purpose of storing
> fsfs-successor ID? Do we need them for proper handling of
> moves/renames or for something else?

We need them for forward-direction history searching, move handing, and
answering the top-ten enterprise question "I fixed this bug HERE, but in
what other branches does the same bug exist?"

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: FSFS successor ID design draft

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On Fri, Aug 26, 2011 at 15:14, Stefan Sperling <st...@elego.de> wrote:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.
>
> Please share your thoughts on this. I have most likely missed a few things.
>
> If we like it I will put it into the repository for further editing,
> either in notes/ on trunk or onto the fsfs-successor ID branch.
>
> If this design has some severe problem I don't mind going back to
> the drawing board. This is my first shot, afterall :)
>
Hi Stefan,

I've may be stupid question: what is the purpose of storing
fsfs-successor ID? Do we need them for proper handling of
moves/renames or for something else?

-- 
Ivan Zhakov

Re: server-side log cache

Posted by Stefan Fuhrmann <eq...@web.de>.
On 30.09.2011 18:19, Stefan Sperling wrote:
> On Thu, Sep 22, 2011 at 08:43:14PM +0200, Stefan Fuhrmann wrote:
>>>>> This looks very interesting.
>>>>>
>>>>> What about FSFS-specific requirements?
>>>> See assumptions above, this may require a different
>>>> data structure. But I think that noderev dependencies
>>>> will turn out to be redundant if you have a log cache
>>>> and access to the skip-delta forwards dependencies.
>>>>> It sounds like you avoid those by storing data in semantics of the repos
>>>>> layer (path@revision) instead of the FS layer (node-revision-id)?
>>>> Yes.
>>>>> In this case separate implementations for FSFS and BDB aren't needed.
>>>>> This could be an advantage (e.g. third party FS implementations
>>>>> wouldn't need to change to support this).
>>>> It also eliminates on of the performance weaknesses
>>>> of SVN today: A log on some old / seldom changed
>>>> path can take a very long time.
>>>>> I'll think about this some more, thanks.
>>>>>
>>>> Welcome ;)
>
> Reviving this thread.
>
> Your concerns about a node-rev based approach seem to resolve largely
> around performance, not about correctness.
Effectiveness, to be precise.
> I.e. you agree that a
> node-rev-based solution as currently being worked on within the
> fs-successor-ids branch will work correctly, but won't perform
> as well as your proposal for certain queries, right?
Since it does not add any information, the node-rev-based
approach will not *cause* incorrect behavior. In that sense,
I agree.

But I still fail to see how it will be effective except for a
very, very specific use-case. I probably just haven't understood
your use-case. Could you give a short description of the problem
that you are trying to solve and how the node-rev cache will help?
>
> Now, I don't feel comfortable trying to implement your design.
> The reason for this is that you could do a much better job at this yourself.
That's fine with me. I won't have the time to do that this year,
though.
>
> However, I do feel very comfortable continuing the work we've started on
> the fs-successor-ids branch.
Having that implementation available will certainly do no harm.
>
> I also think that the two approaches can complement each other.
> We are not in an either/or situation. We will get correct answers either
> way and the only real difference is performance.
>
> Note also that our plan for putting successor-IDs into the filesystem
> layer we will also solve the problem of creating the successor data
> during an upgrade from SVN 1.7 to 1.8.
> Both approaches need to solve this somehow, and we'd have that part
> sorted out for you.
I don't see a problem here. If necessary, we could extend the FS
layer API with version check methods etc.
>
> So, what about this: We implement successor-IDs in the filesystem
> as planned on the fs-successor-ids branch.
> Once we have that, and when you have time, you adapt your log cache
> proposal to create a runtime cache that sits on top of the new
> successor-ID filesystem data, and caches results for certain log queries
> in memory for quick access. It should even be possible to pre-populate
> this cache when the server start up.
How would I reconstruct copy target path names from node-rev info?
>
> This way, we have some amount of redundancy in the system, but Daniel
> and I can continue trying to deliver a working solution for 1.8 based
> on what we've started. And we can worry about performance issues later,
> because you already have a plan for that.
>
> Frankly, I think the time people are wasting today resolving trivial
> tree-conflicts is a huge waste of their time. No matter how bad the
> performance of an automated solution to this problem will be, it will
> be faster than a human being. Our users will get a huge benefit either
> way because we will be reducing their load of manual labour. Performance
> of the solution doesn't need to be perfect by the time we release 1.8
> and nothing stands in the way of improving performance later.
>
> Do you agree?
This depends entirely on your use-case (see above). My experience
with navigating these change graphs indicates that better-than
O(n^2) performance requires completely different algorithms, data
structures and API than a merely correct path-by-path approach.
BTW, n > 10.000.000 for certain repositories.
>
> If not, I hope that you'll find time to help us implement your pure
> caching solution for 1.8. I would really like to see some solution
> to this problem in the 1.8. release.
I'm currently working on other SVN-related projects.
 From April on, I'm available for hire.

-- Stefan^2.


Re: server-side log cache

Posted by Stefan Sperling <st...@elego.de>.
On Thu, Sep 22, 2011 at 08:43:14PM +0200, Stefan Fuhrmann wrote:
> >>>This looks very interesting.
> >>>
> >>>What about FSFS-specific requirements?
> >>See assumptions above, this may require a different
> >>data structure. But I think that noderev dependencies
> >>will turn out to be redundant if you have a log cache
> >>and access to the skip-delta forwards dependencies.
> >>>It sounds like you avoid those by storing data in semantics of the repos
> >>>layer (path@revision) instead of the FS layer (node-revision-id)?
> >>Yes.
> >>>In this case separate implementations for FSFS and BDB aren't needed.
> >>>This could be an advantage (e.g. third party FS implementations
> >>>wouldn't need to change to support this).
> >>It also eliminates on of the performance weaknesses
> >>of SVN today: A log on some old / seldom changed
> >>path can take a very long time.
> >>>I'll think about this some more, thanks.
> >>>
> >>Welcome ;)

Reviving this thread.

Your concerns about a node-rev based approach seem to resolve largely
around performance, not about correctness. I.e. you agree that a
node-rev-based solution as currently being worked on within the
fs-successor-ids branch will work correctly, but won't perform
as well as your proposal for certain queries, right?

Now, I don't feel comfortable trying to implement your design.
The reason for this is that you could do a much better job at this yourself.
However, I do feel very comfortable continuing the work we've started on
the fs-successor-ids branch.

I also think that the two approaches can complement each other.
We are not in an either/or situation. We will get correct answers either
way and the only real difference is performance.

Note also that our plan for putting successor-IDs into the filesystem
layer we will also solve the problem of creating the successor data
during an upgrade from SVN 1.7 to 1.8.
Both approaches need to solve this somehow, and we'd have that part
sorted out for you.

So, what about this: We implement successor-IDs in the filesystem
as planned on the fs-successor-ids branch.
Once we have that, and when you have time, you adapt your log cache
proposal to create a runtime cache that sits on top of the new
successor-ID filesystem data, and caches results for certain log queries
in memory for quick access. It should even be possible to pre-populate
this cache when the server start up.

This way, we have some amount of redundancy in the system, but Daniel
and I can continue trying to deliver a working solution for 1.8 based
on what we've started. And we can worry about performance issues later,
because you already have a plan for that.

Frankly, I think the time people are wasting today resolving trivial
tree-conflicts is a huge waste of their time. No matter how bad the
performance of an automated solution to this problem will be, it will
be faster than a human being. Our users will get a huge benefit either
way because we will be reducing their load of manual labour. Performance
of the solution doesn't need to be perfect by the time we release 1.8
and nothing stands in the way of improving performance later.

Do you agree?

If not, I hope that you'll find time to help us implement your pure
caching solution for 1.8. I would really like to see some solution
to this problem in the 1.8. release.

Re: server-side log cache

Posted by Stefan Fuhrmann <eq...@web.de>.
On 20.09.2011 23:40, Daniel Shahaf wrote:
> Re-reading this to see how it affects the fs-successor-ids work (in
> particular, whether it supersedes the done/planned work), one question:
>
>
> Why do you refer to skip-deltas?
I haven't thought too deeply about the skip-delta issue
but that is my reasoning: For all questions concerning
the "change flow", the noderev level seems to be the
wrong place to go as I tried to explain.

If you are trying to answer repo-internal questions like
"can I remove this chunk of data?", the dependencies
between the individual representations (skip delta,
shared nodereps) may be more interesting.
>
>
> Consider the following transformation to a repository:
>
>    for noderev in FS:
>      noderev.props['svn:contents'] = noderev.cat()
>      noderev.contents = ""
>
> This transformation does not change the successor relations, but it does
> change the skip-deltas mapping.  Would your algorithms for
> successors/copyto work correctly on a repository that has undergone the
> above transformation?
My algorithms don't  care about noderevs or any layer
below that (including representations). The skip-delta
argument was only to illustrate that the noderevs cache
might be even less helpful.

-- Stefan^2.
>
> Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
>> On 29.08.2011 18:35, Stefan Sperling wrote:
>>
>> Sorry for the late response. I have been knocked
>> out for a while.
>>> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>>>> for what this is all about.
>>>> But the assumptions in that file are actually not valid.
>>> Which ones are invalid? Can you explain in detail?
>> It makes at least 5 implicit assumptions:
>>
>> * Noderevs are the relevant level on which user-level
>>    questions can be answered.
>> ->  That might not be the case. Noderevs are a purely
>>     internal construct. In particular, a noderev might get
>>     copied but the actual delta is empty. Thinking about
>>     svn obliterate&  friends that might actually happen
>>     as a result of some node removal / replacement.
>>
>> * User-level queries (where got node copied to?) and
>>    internal queries (what are the copy dependencies?)
>>    are looking for the same information.
>> ->  This is linked to the previous. Noderev copy-to info
>>     can be useful to determine the dependency graph.
>>     The point here is that we are mixing queries on two
>>     different layers here. If we need both, they should
>>     both exist as separate APIs.
>>
>> * Noderevs changes are relevant for all svn obliterate.
>> ->  That may not always be true. If you only want to
>>     remove "unused" history, the references in the skip-
>>     delta tree express the relevant dependencies.
>>
>> * Noderev changes are the sufficient to answer the
>>    "where has this been copied to?" question.
>> ->  They are quite unhelpful here. Noderevs might
>>     help find you changes on the node itself (see first
>>     assumption). But you need to evaluate the user-
>>     level copy operations of all parent folders for all
>>     currently known copies. There are far less of these
>>     copies (e.g. branch creation) than noderev changes
>>     (typ. multiple per revision).
>>
>> * A noderev-based cache alone would reduce the
>>    amount of data to inspect / process for these queries.
>> ->  The noderev cache is *larger* than e.g. a log cache
>>     due to each change also modifying all parent nodes.
>>     To be efficient, you need a node-level copy-to cache
>>     as well (see previous item).
>>>> * "Where did path P at rev N to M directly or indirectly
>>>>   copied to, e.g. which releases contain a certain faulty
>>>>   code segment; optionally list changes to targets?"
>>>> ->   needs to scan parent paths for copies, too
>>>>    (reversed "log", revision graph)
>>> Yes, the successor-id cache only gives us operation roots.
>>> Information for child nodes needs to be derived -- it is not
>>> within the scope of the cache itself.
>> That is unnecessarily complicated / expensive.
>> Using the proposed log cache, the relevant
>> information can be found in a *single*, highly
>> efficient scan.
>>>> It turns out that we can produce even humongous
>>>> reverse logs (50+ k nodes, thousands of branches
>>>> and tags) within a second by simply performing
>>>> a full history scan.
>>>>
>>>> A example of how the whole process can be
>>>> implemented efficiently, can be found here:
>>>>
>>>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
>>> I'll take a look at that, thanks!
>> The gist of it is that you keep all currently known
>> copy targets (e.g. branches, tags) in a tree. For
>> each revision, you walk that tree to find the affected
>> sub-tree. The LRU path gets optimized to speed up
>> the next look-up.
>>
>> A usually fixed repo layout plus typical change
>> patterns result in a quasi-constant processing
>> time for each revision - independent from the
>> number of copies already found.
>>
>> Things become much simpler and faster if you
>> eliminate string operations and use path IDs
>> instead. TSVN does an "svn log" at over 10M
>> revs per second.
>>>>> Storage of successor IDs is essentially a cache of the result of the
>>>>> inversion of the predecessor relation between all node-revisions.
>>>>> This inversion is a *very* expensive operation (follow every node
>>>>> that ever existed in the repository from its most recent revision
>>>>> back to the revision it was created in).
>>>> Not true. At least the "*very* expensive" part.
>>>> Log -v takes about 1min for AFS over ra_local.
>>>> Building any of the index data described below
>>>> can be done in<   10s.
>>> Any of it (if so, which parts?), or all of it?
>> All of it, of course. The expensive part is "svn log -v"
>> which may be expensive on slow disks. Everything
>> else is just writing a few MB of data.
>>>> I propose a modified version of TSVN's log cache
>>>> data structure. The adaptations:
>>>>
>>>> * split into multiple files to reduce modification overhead
>>>> * remove rev-prop-dependent information (log comment,
>>>>    author, timestamp)
>>>> * add the reverse copy info
>>>> * simplify the data structures
>>> This looks very interesting.
>>>
>>> What about FSFS-specific requirements?
>> See assumptions above, this may require a different
>> data structure. But I think that noderev dependencies
>> will turn out to be redundant if you have a log cache
>> and access to the skip-delta forwards dependencies.
>>> It sounds like you avoid those by storing data in semantics of the repos
>>> layer (path@revision) instead of the FS layer (node-revision-id)?
>> Yes.
>>> In this case separate implementations for FSFS and BDB aren't needed.
>>> This could be an advantage (e.g. third party FS implementations
>>> wouldn't need to change to support this).
>> It also eliminates on of the performance weaknesses
>> of SVN today: A log on some old / seldom changed
>> path can take a very long time.
>>> I'll think about this some more, thanks.
>>>
>> Welcome ;)
>>
>> -- Stefan^2.


Re: server-side log cache

Posted by Daniel Shahaf <da...@elego.de>.
Re-reading this to see how it affects the fs-successor-ids work (in
particular, whether it supersedes the done/planned work), one question:


Why do you refer to skip-deltas?


Consider the following transformation to a repository:

  for noderev in FS:
    noderev.props['svn:contents'] = noderev.cat()
    noderev.contents = ""

This transformation does not change the successor relations, but it does
change the skip-deltas mapping.  Would your algorithms for
successors/copyto work correctly on a repository that has undergone the
above transformation?



Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
> On 29.08.2011 18:35, Stefan Sperling wrote:
> 
> Sorry for the late response. I have been knocked
> out for a while.
> >On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >>>See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >>>for what this is all about.
> >>But the assumptions in that file are actually not valid.
> >Which ones are invalid? Can you explain in detail?
> It makes at least 5 implicit assumptions:
> 
> * Noderevs are the relevant level on which user-level
>   questions can be answered.
> -> That might not be the case. Noderevs are a purely
>    internal construct. In particular, a noderev might get
>    copied but the actual delta is empty. Thinking about
>    svn obliterate & friends that might actually happen
>    as a result of some node removal / replacement.
> 
> * User-level queries (where got node copied to?) and
>   internal queries (what are the copy dependencies?)
>   are looking for the same information.
> -> This is linked to the previous. Noderev copy-to info
>    can be useful to determine the dependency graph.
>    The point here is that we are mixing queries on two
>    different layers here. If we need both, they should
>    both exist as separate APIs.
> 
> * Noderevs changes are relevant for all svn obliterate.
> -> That may not always be true. If you only want to
>    remove "unused" history, the references in the skip-
>    delta tree express the relevant dependencies.
> 
> * Noderev changes are the sufficient to answer the
>   "where has this been copied to?" question.
> -> They are quite unhelpful here. Noderevs might
>    help find you changes on the node itself (see first
>    assumption). But you need to evaluate the user-
>    level copy operations of all parent folders for all
>    currently known copies. There are far less of these
>    copies (e.g. branch creation) than noderev changes
>    (typ. multiple per revision).
> 
> * A noderev-based cache alone would reduce the
>   amount of data to inspect / process for these queries.
> -> The noderev cache is *larger* than e.g. a log cache
>    due to each change also modifying all parent nodes.
>    To be efficient, you need a node-level copy-to cache
>    as well (see previous item).
> >
> >>* "Where did path P at rev N to M directly or indirectly
> >>  copied to, e.g. which releases contain a certain faulty
> >>  code segment; optionally list changes to targets?"
> >>->  needs to scan parent paths for copies, too
> >>   (reversed "log", revision graph)
> >Yes, the successor-id cache only gives us operation roots.
> >Information for child nodes needs to be derived -- it is not
> >within the scope of the cache itself.
> That is unnecessarily complicated / expensive.
> Using the proposed log cache, the relevant
> information can be found in a *single*, highly
> efficient scan.
> >
> >>It turns out that we can produce even humongous
> >>reverse logs (50+ k nodes, thousands of branches
> >>and tags) within a second by simply performing
> >>a full history scan.
> >>
> >>A example of how the whole process can be
> >>implemented efficiently, can be found here:
> >>
> >>https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> >I'll take a look at that, thanks!
> The gist of it is that you keep all currently known
> copy targets (e.g. branches, tags) in a tree. For
> each revision, you walk that tree to find the affected
> sub-tree. The LRU path gets optimized to speed up
> the next look-up.
> 
> A usually fixed repo layout plus typical change
> patterns result in a quasi-constant processing
> time for each revision - independent from the
> number of copies already found.
> 
> Things become much simpler and faster if you
> eliminate string operations and use path IDs
> instead. TSVN does an "svn log" at over 10M
> revs per second.
> >>>Storage of successor IDs is essentially a cache of the result of the
> >>>inversion of the predecessor relation between all node-revisions.
> >>>This inversion is a *very* expensive operation (follow every node
> >>>that ever existed in the repository from its most recent revision
> >>>back to the revision it was created in).
> >>Not true. At least the "*very* expensive" part.
> >>Log -v takes about 1min for AFS over ra_local.
> >>Building any of the index data described below
> >>can be done in<  10s.
> >Any of it (if so, which parts?), or all of it?
> All of it, of course. The expensive part is "svn log -v"
> which may be expensive on slow disks. Everything
> else is just writing a few MB of data.
> >>I propose a modified version of TSVN's log cache
> >>data structure. The adaptations:
> >>
> >>* split into multiple files to reduce modification overhead
> >>* remove rev-prop-dependent information (log comment,
> >>   author, timestamp)
> >>* add the reverse copy info
> >>* simplify the data structures
> >This looks very interesting.
> >
> >What about FSFS-specific requirements?
> See assumptions above, this may require a different
> data structure. But I think that noderev dependencies
> will turn out to be redundant if you have a log cache
> and access to the skip-delta forwards dependencies.
> >It sounds like you avoid those by storing data in semantics of the repos
> >layer (path@revision) instead of the FS layer (node-revision-id)?
> Yes.
> >In this case separate implementations for FSFS and BDB aren't needed.
> >This could be an advantage (e.g. third party FS implementations
> >wouldn't need to change to support this).
> It also eliminates on of the performance weaknesses
> of SVN today: A log on some old / seldom changed
> path can take a very long time.
> >
> >I'll think about this some more, thanks.
> >
> Welcome ;)
> 
> -- Stefan^2.

Re: server-side log cache

Posted by Stefan Fuhrmann <eq...@web.de>.
On 29.08.2011 18:35, Stefan Sperling wrote:

Sorry for the late response. I have been knocked
out for a while.
> On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
>>> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
>>> for what this is all about.
>> But the assumptions in that file are actually not valid.
> Which ones are invalid? Can you explain in detail?
It makes at least 5 implicit assumptions:

* Noderevs are the relevant level on which user-level
   questions can be answered.
-> That might not be the case. Noderevs are a purely
    internal construct. In particular, a noderev might get
    copied but the actual delta is empty. Thinking about
    svn obliterate & friends that might actually happen
    as a result of some node removal / replacement.

* User-level queries (where got node copied to?) and
   internal queries (what are the copy dependencies?)
   are looking for the same information.
-> This is linked to the previous. Noderev copy-to info
    can be useful to determine the dependency graph.
    The point here is that we are mixing queries on two
    different layers here. If we need both, they should
    both exist as separate APIs.

* Noderevs changes are relevant for all svn obliterate.
-> That may not always be true. If you only want to
    remove "unused" history, the references in the skip-
    delta tree express the relevant dependencies.

* Noderev changes are the sufficient to answer the
   "where has this been copied to?" question.
-> They are quite unhelpful here. Noderevs might
    help find you changes on the node itself (see first
    assumption). But you need to evaluate the user-
    level copy operations of all parent folders for all
    currently known copies. There are far less of these
    copies (e.g. branch creation) than noderev changes
    (typ. multiple per revision).

* A noderev-based cache alone would reduce the
   amount of data to inspect / process for these queries.
-> The noderev cache is *larger* than e.g. a log cache
    due to each change also modifying all parent nodes.
    To be efficient, you need a node-level copy-to cache
    as well (see previous item).
>
>> * "Where did path P at rev N to M directly or indirectly
>>   copied to, e.g. which releases contain a certain faulty
>>   code segment; optionally list changes to targets?"
>> ->  needs to scan parent paths for copies, too
>>    (reversed "log", revision graph)
> Yes, the successor-id cache only gives us operation roots.
> Information for child nodes needs to be derived -- it is not
> within the scope of the cache itself.
That is unnecessarily complicated / expensive.
Using the proposed log cache, the relevant
information can be found in a *single*, highly
efficient scan.
>
>> It turns out that we can produce even humongous
>> reverse logs (50+ k nodes, thousands of branches
>> and tags) within a second by simply performing
>> a full history scan.
>>
>> A example of how the whole process can be
>> implemented efficiently, can be found here:
>>
>> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> I'll take a look at that, thanks!
The gist of it is that you keep all currently known
copy targets (e.g. branches, tags) in a tree. For
each revision, you walk that tree to find the affected
sub-tree. The LRU path gets optimized to speed up
the next look-up.

A usually fixed repo layout plus typical change
patterns result in a quasi-constant processing
time for each revision - independent from the
number of copies already found.

Things become much simpler and faster if you
eliminate string operations and use path IDs
instead. TSVN does an "svn log" at over 10M
revs per second.
>>> Storage of successor IDs is essentially a cache of the result of the
>>> inversion of the predecessor relation between all node-revisions.
>>> This inversion is a *very* expensive operation (follow every node
>>> that ever existed in the repository from its most recent revision
>>> back to the revision it was created in).
>> Not true. At least the "*very* expensive" part.
>> Log -v takes about 1min for AFS over ra_local.
>> Building any of the index data described below
>> can be done in<  10s.
> Any of it (if so, which parts?), or all of it?
All of it, of course. The expensive part is "svn log -v"
which may be expensive on slow disks. Everything
else is just writing a few MB of data.
>> I propose a modified version of TSVN's log cache
>> data structure. The adaptations:
>>
>> * split into multiple files to reduce modification overhead
>> * remove rev-prop-dependent information (log comment,
>>    author, timestamp)
>> * add the reverse copy info
>> * simplify the data structures
> This looks very interesting.
>
> What about FSFS-specific requirements?
See assumptions above, this may require a different
data structure. But I think that noderev dependencies
will turn out to be redundant if you have a log cache
and access to the skip-delta forwards dependencies.
> It sounds like you avoid those by storing data in semantics of the repos
> layer (path@revision) instead of the FS layer (node-revision-id)?
Yes.
> In this case separate implementations for FSFS and BDB aren't needed.
> This could be an advantage (e.g. third party FS implementations
> wouldn't need to change to support this).
It also eliminates on of the performance weaknesses
of SVN today: A log on some old / seldom changed
path can take a very long time.
>
> I'll think about this some more, thanks.
>
Welcome ;)

-- Stefan^2.

server-side log cache (was: Re: FSFS successor ID design draft)

Posted by Stefan Sperling <st...@elego.de>.
On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >for what this is all about.
> But the assumptions in that file are actually not valid.

Which ones are invalid? Can you explain in detail?

> * "Where did path P at rev N to M directly or indirectly
>  copied to, e.g. which releases contain a certain faulty
>  code segment; optionally list changes to targets?"
> -> needs to scan parent paths for copies, too
>   (reversed "log", revision graph)

Yes, the successor-id cache only gives us operation roots.
Information for child nodes needs to be derived -- it is not
within the scope of the cache itself.

> It turns out that we can produce even humongous
> reverse logs (50+ k nodes, thousands of branches
> and tags) within a second by simply performing
> a full history scan.
> 
> A example of how the whole process can be
> implemented efficiently, can be found here:
> 
> https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp

I'll take a look at that, thanks!

> >Storage of successor IDs is essentially a cache of the result of the
> >inversion of the predecessor relation between all node-revisions.
> >This inversion is a *very* expensive operation (follow every node
> >that ever existed in the repository from its most recent revision
> >back to the revision it was created in).
> Not true. At least the "*very* expensive" part.
> Log -v takes about 1min for AFS over ra_local.
> Building any of the index data described below
> can be done in < 10s.

Any of it (if so, which parts?), or all of it?

> I propose a modified version of TSVN's log cache
> data structure. The adaptations:
> 
> * split into multiple files to reduce modification overhead
> * remove rev-prop-dependent information (log comment,
>   author, timestamp)
> * add the reverse copy info
> * simplify the data structures

This looks very interesting.

What about FSFS-specific requirements?
It sounds like you avoid those by storing data in semantics of the repos
layer (path@revision) instead of the FS layer (node-revision-id)?
In this case separate implementations for FSFS and BDB aren't needed.
This could be an advantage (e.g. third party FS implementations
wouldn't need to change to support this).

I'll think about this some more, thanks.

Re: FSFS successor ID design draft

Posted by Stefan Fuhrmann <eq...@web.de>.
On 26.08.2011 13:14, Stefan Sperling wrote:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.
>
> Please share your thoughts on this. I have most likely missed a few things.
Sure. Given that I implemented a bit more than that
in TSVN, I have some things to say about what works
and what doesn't.
>
>
> FSFS successor IDs
> ==================
>
> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> for what this is all about.
But the assumptions in that file are actually not valid.
There are at least following use-cases for successor IDs:

* "has there been a change for path P since rev R?"
-> a simple svn info + opt. svn ls will answer that

* "has there been a change for the subtree of path P
   since rev R?"
-> a simple svn info will answer that

* "where have changes been made to / after (P,R);
   search in copies, too?"
-> This can actually be answered directly using
   the proposed successor IDs.

* "Where did path P at rev N to M directly or indirectly
  copied to, e.g. which releases contain a certain faulty
  code segment; optionally list changes to targets?"
-> needs to scan parent paths for copies, too
   (reversed "log", revision graph)

* "Where has a given list of patches been applied
   to (directly or via merge); and where not?"
-> similarly, reversed log -g


Operation to implement
======================

The core "reverse log" operation would return
data similar to what "log". For that, we need to
follow all copies, including parent copies:

rN:   M /trunk/p/f
rN+1: A /tags/t from /trunk@N
rN+2: A /branches/b from /tags/t1@N+1
rN+3: M /branches/p/f

You want all these revisions to be reported
(possibly filtering the result in a later stage).
One problem here is that we can't use the successor
ID relation for noderevs here because it will report
*every* revision as we need to follow /trunk, /tags
and /branches.

It turns out that we can produce even humongous
reverse logs (50+ k nodes, thousands of branches
and tags) within a second by simply performing
a full history scan.

A example of how the whole process can be
implemented efficiently, can be found here:

https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp

Also, the change info returned by log should be
extended by an enum {parent, this, subtree}. This
would indicate whether that change corresponds
to a parent of the original path (e.g. when creating
a branch), to the original path itself or to some
sub-tree.

>
> What follows are ideas about implementing successor ID support for FSFS.
>
> Useful background information:
>    Node-revisions: subversion/libsvn_fs_base/notes/fs-history
>                    subversion/libsvn_fs_base/notes/structure
>    fsfs design: subversion/libsvn_fs_fs/structure
>
>
> The "cache" and its properties
> ==============================
>
> Storage of successor IDs is essentially a cache of the result of the
> inversion of the predecessor relation between all node-revisions.
> This inversion is a *very* expensive operation (follow every node
> that ever existed in the repository from its most recent revision
> back to the revision it was created in).
Not true. At least the "*very* expensive" part.
Log -v takes about 1min for AFS over ra_local.
Building any of the index data described below
can be done in < 10s.
>
>
> In the following, "cache" refers to the store for successor IDs.
>
> After a general introduction, a proposed cache design is presented.
> Note that the cache implementation will be hidden behind the API of
> libsvn_fs_fs and can be improved or changed in every new minor
> release of Subversion.
>
> Required properties of successor ID cache for FSFS:
>   - FSFS-compatible:
>     - cache must use some plain file format
>     - writers do not block readers (i.e. sqlite is out of the question)
>     - cache must not rely on modifying existing revision files
>   - cache can be regenerated on demand
>   - do not corrupt the cache in case of an unexpected crash
>   - try to keep number of directory entries low
>   - conserve inodes by design (preferred), or via packing support
Node IDs may not be the most convenient level to operate
on. They imply a certain redundancy (any change requires
new node IDs for all parent paths).
>
> Nice-to-have properties:
>   - the repository can be live during cache generation
I propose a modified version of TSVN's log cache
data structure. The adaptations:

* split into multiple files to reduce modification overhead
* remove rev-prop-dependent information (log comment,
   author, timestamp)
* add the reverse copy info
* simplify the data structures

>
>
> Storage size requirements
> =========================
>
> Ballpark figures about node-revs from #svn-dev:
>    * cmpilato does some math.  There are over 350,000 node revisions in my
>        ancient copy of the svn.collab.net Subversion repos.
>    <hwright>  there are over 20M of them in the ASF repo
<snip>

As of r1162184, the AFS repo contains
* 9002894 changes on
* 4272329 different paths constructed from
* 791804 path elements (different file and folder names)
(over 20M words of commentary etc.)

Total cache size on disk: 79MB
Estimated size after adaptations: 60 .. 70M
(due to significant duplication)

>  Implementation proposal
>  =======================

<snip>

Data model
==========

revisions -> { revision }          // array, indexed by revnum
revision  -> ( common_base_pathID, // pathID of the longest common
                                       base path of all changes
                common_change_mask, // indicates what kind of changes
                                       is contained in this revision
                changes_count,
                { change } )
change    -> ( change_kind, pathID [, fromRev, from_pathID] )
pathID    -> ( parent_pathID, elementID )
                                    // pathID 0 is for "/"
                                    // parent_pathID < pathID
elementID -> string                // 0-terminated UTF8

fwdCopies -> { fwdCopy }           // ordered by source_rev
fwdCopy   -> { source_pathID, source_rev, target_pathID, target_rev }
                                    // source rev is going to be implicit

The pathID <-> path and elementID <-> element string
relations are bijective.


External format
===============

This is not a strict append-only format as individual
files will get replaced / rewritten during commit.
Because it will never remove records describing previous
revisions, reading while writing is safe assuming that
you limit your query to committed revisions.

All data files are being sharded to support even huge
repositories and unusual usage patterns. A query will
read, extract the fragments into an LRU cache and, in
most cases, keep them until the query is finished.

(I) Overall structure

/logcache/ids         next free IDs
/logcache/elem2s/*    elementID -> string; 4k per shard
/logcache/s2elem/*    string -> elementID; 256 fragments
/logcache/id2path/*   pathID -> (parentID, elementID); 16k per shard
/logcache/path2id/*   (parentID, elementID) -> pathID; 256 fragments
/logcache/revision/*  revisions; 4k revs per shard
/logcache/fwdcopy/*   fwdCopies; 4k revs per shard

(II) Path elements

* elementID -> string
   a simple concatenation of zero-terminated UTF8 strings
* string -> elementID
   - shard selected by lower 4 bits of the first two chars
     in string (padding with 0 if necessary)
   - entry counter, followed by a sequence of 0-terminated
     UTF-8 strings sorted lexicographically, followed by the
     list of corresponding elementIDs

(III) Path IDs

* pathID -> (parentID, elementID)
   list of parentIDs as differences to the previous value,
   followed by the list of elementIDs (non-differential)
* (parentID, elementID) -> pathID
   - shard selected by (parentID % 255) ^ (elementID % 255)
   - entry counter, followed by a sequence of differential parentIDs,
     followed by the list of differential elementIDs,
     followed by the list of corresponding pathIDs,
     sorted lexicographically by (parentID, pathID)

(IV) Revisions
*each shard contains:
   list of common base pathID,
list of aggregated changes flags,
list of numbers of changes per respective revision,
   list of change kinds,
   list of pathIDs,
   list of from_revs (only where indicated in change kind)
   list of from_pathIDs (only where indicated in change kind)

(V) FwdCopies
*each shard contains (indexed by source rev):
list of numbers of forward copies per respective revision,
list of source_pathIDs,
list of target_pathIDs,
list of target_revs (as delta to the implicit source_rev),

(VI) Storage details

* unsigned integers are variable-length encoded (see svndiff.c)
* signed integers are variable-length encoded after mapping
   them onto unsigned ints: sint x -> 2 abs(x) + sign(x)
* data has been arranged such that efficient data compression
   is possible (details not in this post)

-- Stefan^2.


Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Sperling wrote on Mon, Aug 29, 2011 at 18:01:38 +0200:
> On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> > >  - cache can be regenerated on demand
> > 
> > Regenerated *offline*, right?  ie, if the cache is lost it can be
> > derived from the revision files (which remain authoritative), but this
> > process need not withstand concurrent readers.  (The problem being the
> > need to decrement the cached-rev file)
> 
> Well, decrementing the cached-rev file would have to be done offline.
> But the cache can be rebuilt while the repository is active.
> 
> We could have a flag that says whether the cache is active.
> E.g. after an upgrade from an older repository format, the cache
> would be flagged inactive, until it has been populated.
> 
> When rebuilding the cache we'd take the repository offline,
> mark the cache inactive, remove the cached-rev file, go online,
> and rebuild the cache. Readers will get information for older
> revisions only until the cache has been rebuilt.
> 

We mentioned a format number for the cache files; this "active flag"
almost smells like a format number for the entire cache... (with
format 0 being "inactive")

> > > The index size and offsets are 64 bit values.
> > 
> > Stored as?  ASCII decimal?  (If so: how is it delimited or padded?)  Big
> > endian binary?
> 
> Some binary format. Fixed size.
> 

Fair enough.

> > Q: How do you update the 'offset of free space' in a manner that is
> > consistent to readers?
> > 
> > ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> > must avoid a reader accidentally seeing '1800'.
> 
> The offsets appear at the start of a disk block.
> They should be a fixed-sized binary value -- ASCII has problems
> as you've pointed out.
> 
> In any case, I am assuming that flushing data to disk flushes one disk
> block. I.e. we can update one block atomically during flush.
> Maybe this assumption isn't valid.
> 

ack

> > (I seem to remember some ideas exchanged on the revprop-packing design
> > discussion --- including one that involved having two offsets written
> > down in the file, and one of them *always* being valid (but not always
> > the same one) --- perhaps Peter will remember...)
> 
> Do you have a link to the archives?
> 

Not exactly, sorry.  Most discussions happened in this (huge) thread:

    From: kmradke@rockwellcollins.com
    To: dev@subversion.apache.org
    Subject: Does fsfs revprop packing no longer allow usage of traditional backup
            software?
    Message-ID: <OF...@rockwellcollins.com>

The crux of the idea, though, is to have two byte locations for the
offset --- say, store an integer at byte 4-7 and another at bytes 8-11
--- and have some logic to choose which one of those to read and treat
as the offset.  In the meantime the other integer can be updated
arbitrarily.

> > So, invariant: given a node-rev, all its s-blocks, except perhaps the
> > last (in order of appearance in the file), have at most <size of the
> > longest possible node-rev-id, minus one> free (zero) bytes at the end.
> 
> Did you mean "at least"?
>  

No.  I said that "All completed s-blocks have at most O(1) free bytes
at the end".

> > > It updates the 'next s-block' offset in the previous s-block to the
> > > offset of the new s-block, and flushes the cache file to disk.
> > > 
> > 
> > At what point does the writer update the cached-rev file?
> 
> The writer doesn't update the cached-rev file. This is done by
> the global cache sync code.
> 
> I forgot to include two steps for the writer, though.
> It needs to update the offset to the last s-block in the index
> 

You forgot to include "two steps", but you only list one step here.  Did
you forgot to list the other step which you had originally forgotten to
list?

> > So, a lost block is in the file.  You could avoid that by seeking, not
> > to the end of the file but to 
> > 
> > SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
> > 
> > (and, yes, I realize that this is a crash situation. two answers to
> > that: first, it's easy to avoid having an unused block; second, if
> > a file grows an unused block then a DOS can translate to running out of
> > disk space.)
> 
> I don't really understand you here.
> Are you suggesting that the code should check for a dead s-block
> first, before appending to the file?
> 

I'd like to avoid growing 'holes' in the middle of the file that readers
will never access.  (I don't want to require dump|load in order to lose
such holes.)

> That MAX doesn't make sense. Each node_rev_id only appears once
> in the index. Maybe that wasn't clear yet?  There is exacrty one entry
> for each node-rev-id from revision N in the cache file for revision N.
>  

Clear.

> > >     for node_rev in $rev {
> > >       obtain predecessor p_node_rev of $node_rev
> > >       obtain revision p_rev from $p_node_rev
> > 
> > Should the cache note which of the successors is the M-successor (the
> > one with the same copy-id)?  Perhaps by stipulating that the M-successor
> > is always the first in the list of successors, and leaving room for it
> > if a C-successor (a copy) is created before the M-successor.
> 
> We don't know how long the M-successor ID is going to be.
> 

Actually we can bound it, but the bound I have in mind is going to be
a large overestimate for the common case.

> If we want to optimise locating the M-successor, we could add the
> offset to the s-block which contains the M-successor to the index,
> and use a special flag byte to terminate the M-successor string.
> 

This is probably going to be more space-efficient, yes.

> > (It does mean we have two codepaths though, eg both packed and
> > non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> > *only* packed shards, and to use the expensive replay method to answer
> > queries about revisions younger than the latest pack file?  (We may want
> > to use a smaller shard size in this case.)
> > 
> > And, by the way, this idea isn't /so/ unwieldy.  I think that you can
> > use *only* pack files --- and also use pack files for unfinished shards
> > --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> > move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> > the old pack file.  The advantage here would be not having the
> > both-packed-and-nonpacked-shards headache to deal with.)
> 
> Might be possible. Not sure if that's worth it.
> 
> > >       write $rev into 'cached-rev' file
> > >       flush 'cached-rev' file
> > 
> > No, you don't flush, you move-into-place.
> 
> I decided to avoid move-into-place on purpose.
> 
> This is my biggest question:
> 
> Does FSFS presently move stuff into place if the move target location
> already exists? I am asking because I don't think this would work on
> Windows. We cannot overwrite a file if a reader has opened it.
> Or do we really use the delete-retry-loop on Windows servers?
> 
> The only situation I can think of is where a previous commit created
> a revision file and didn't update 'current'. In which case it's unlikely
> that the revision file will be open when the next commits moves a file
> with the same name into place.
> 

As I said on IRC: look into revprop editing in 1.6 and 1.7 FSFS.

(and probably the locks/ hierarchy, too; see fsfs/lock.c)

> > Okay, so the cache reading procedure MUST change, to account for the
> > non-packed cache file vanishing (by a concurrent packer) and the pack
> > file appearing instead.  This is another layer of headache on top of the
> > base one :-(
> 
> Do we support packing while the repository is live?

Yes.

> I haden't considered this.
> 
> Thanks for your feedback!

Welcome.

Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Sat, Aug 27, 2011 at 02:15:31AM +0300, Daniel Shahaf wrote:
> Add it as sibling of 'structure' on the branch?  That's its ultimate home.

Fair enough.

> >  - cache can be regenerated on demand
> 
> Regenerated *offline*, right?  ie, if the cache is lost it can be
> derived from the revision files (which remain authoritative), but this
> process need not withstand concurrent readers.  (The problem being the
> need to decrement the cached-rev file)

Well, decrementing the cached-rev file would have to be done offline.
But the cache can be rebuilt while the repository is active.

We could have a flag that says whether the cache is active.
E.g. after an upgrade from an older repository format, the cache
would be flagged inactive, until it has been populated.

When rebuilding the cache we'd take the repository offline,
mark the cache inactive, remove the cached-rev file, go online,
and rebuild the cache. Readers will get information for older
revisions only until the cache has been rebuilt.

> >   More precisely, the amount of successors listed in the cache is
> >   equal to the number of node revisions in the repository, minus
> >   the number of nodes in HEAD (which don't presently have successors)
> >   and minus the number of nodes which were deleted in history (they
> >   don't have successors either).
> >   
> 
> Do you mean "nodes" or "noderevs"?

This should say "node-revs in HEAD".

> Why are you subtracting the number of deletes?

Oh. Well, that part is wrong and should be removed.

> > The index size and offsets are 64 bit values.
> 
> Stored as?  ASCII decimal?  (If so: how is it delimited or padded?)  Big
> endian binary?

Some binary format. Fixed size.

> >  +----------------------------------------------------------------------+
> >  |                           index size                                 |
> >  +----------------------------------------------------------------------+
> >  | node_rev_id | offset to my first s-block | offset to my last s-block |
> >  | node_rev_id | offset to my first s-block | offset to my last s-block |
> >  | node_rev_id | offset to my first s-block | offset to my last s-block |
> 
> Just to clarify: both offsets are to the *first* byte of the s-block,
> correct?

Yes.

> Are the offset relative to the file or to the end of the index?

Haven't made up my mind about yet. 

For this first draft, consider all offsets relative to the start
of the cache file. However, we can still change this.

> >  |                               ...                                    |    
> >  +----------------------------------------------------------------------+
> >  |               zero padding to multiple of s-block size               |
> >  +----------------------------------------------------------------------+
> 
> 
> You pad [index size, array_of[noderevid offset offset]] rather than
> just array_of[noderevid offset offset].  Correct?

Yes, the entire index fits into a multiple of the block size.

> What is the size of an s-block?

A constant. See later in the text.

> >  +----------------------------------------------------------------------+
> >  |                    offset of next s-block                            |
> 
> Perhaps store the length of this s-block?  That seems more robust to me.

The s-block length is constant.

> >  +----------------------------------------------------------------------+
> >  | successor_id, successor_id, ...                                      |
> 
> What is the separator between successor_id's?
>
> (Perhaps it's 0x2C 0x20, but that's not quite clear from the text.)

Some value, whatever. E.g. '\0'.

> > 8192 bytes will provide enough space for at least 128 successors of a
> > node-rev-id. Which means we'll have to allocate a new s-block about every
> > 128 successors
> 
> 128 is the wrong figure, you overlooked the two offsets at the top of
> the s-block.

Yes, it's just an approximation.

> > (usually a slightly higher amount of successors should fit).
> 
> Indeed, assuming that the successor_id's list has variable-sized
> entries.  (ie, that the sucessors' noderev id's are not padded.)

Yes, they are variable size.

> > If more successors must fit SBLOCK_SIZE should be some multiple of 8192.
> > 
> 
> Multiple of 2**13?  Just plain old 'power of 2' isn't good enough?

Not if we want the blocks to align with disk blocks.

> [ And how does the higher-level API deal with a situation where the
> cache is a far cry behind HEAD?  It clearly shouldn't do a full history
> crawl by default, and there ought to be numerous ways to achieve that. ]

This is tricky. We need some heuristic.

Though Stefan^2 points out that rebuilding the cache might not be as
expensive as we thought it would be.
 
> Oh, so the s-blocks for a given noderev might not all be contiguous?

Exactly.

> Q: How do you update the 'offset of free space' in a manner that is
> consistent to readers?
> 
> ie, if the offset was ASCII ' 800' and the new offset is '1600', you
> must avoid a reader accidentally seeing '1800'.

The offsets appear at the start of a disk block.
They should be a fixed-sized binary value -- ASCII has problems
as you've pointed out.

In any case, I am assuming that flushing data to disk flushes one disk
block. I.e. we can update one block atomically during flush.
Maybe this assumption isn't valid.

> (I seem to remember some ideas exchanged on the revprop-packing design
> discussion --- including one that involved having two offsets written
> down in the file, and one of them *always* being valid (but not always
> the same one) --- perhaps Peter will remember...)

Do you have a link to the archives?

> So, invariant: given a node-rev, all its s-blocks, except perhaps the
> last (in order of appearance in the file), have at most <size of the
> longest possible node-rev-id, minus one> free (zero) bytes at the end.

Did you mean "at least"?
 
> > It updates the 'next s-block' offset in the previous s-block to the
> > offset of the new s-block, and flushes the cache file to disk.
> > 
> 
> At what point does the writer update the cached-rev file?

The writer doesn't update the cached-rev file. This is done by
the global cache sync code.

I forgot to include two steps for the writer, though.
It needs to update the offset to the last s-block in the index

> So, a lost block is in the file.  You could avoid that by seeking, not
> to the end of the file but to 
> 
> SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index
> 
> (and, yes, I realize that this is a crash situation. two answers to
> that: first, it's easy to avoid having an unused block; second, if
> a file grows an unused block then a DOS can translate to running out of
> disk space.)

I don't really understand you here.
Are you suggesting that the code should check for a dead s-block
first, before appending to the file?

That MAX doesn't make sense. Each node_rev_id only appears once
in the index. Maybe that wasn't clear yet?  There is exacrty one entry
for each node-rev-id from revision N in the cache file for revision N.
 
> >     for node_rev in $rev {
> >       obtain predecessor p_node_rev of $node_rev
> >       obtain revision p_rev from $p_node_rev
> 
> Should the cache note which of the successors is the M-successor (the
> one with the same copy-id)?  Perhaps by stipulating that the M-successor
> is always the first in the list of successors, and leaving room for it
> if a C-successor (a copy) is created before the M-successor.

We don't know how long the M-successor ID is going to be.

If we want to optimise locating the M-successor, we could add the
offset to the s-block which contains the M-successor to the index,
and use a special flag byte to terminate the M-successor string.

> (It does mean we have two codepaths though, eg both packed and
> non-packed shards; and <cue wild idea> perhaps it'll be easier to have
> *only* packed shards, and to use the expensive replay method to answer
> queries about revisions younger than the latest pack file?  (We may want
> to use a smaller shard size in this case.)
> 
> And, by the way, this idea isn't /so/ unwieldy.  I think that you can
> use *only* pack files --- and also use pack files for unfinished shards
> --- if, when you run out of INDEX_SIZE, you rewrite the whole file and
> move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
> the old pack file.  The advantage here would be not having the
> both-packed-and-nonpacked-shards headache to deal with.)

Might be possible. Not sure if that's worth it.

> >       write $rev into 'cached-rev' file
> >       flush 'cached-rev' file
> 
> No, you don't flush, you move-into-place.

I decided to avoid move-into-place on purpose.

This is my biggest question:

Does FSFS presently move stuff into place if the move target location
already exists? I am asking because I don't think this would work on
Windows. We cannot overwrite a file if a reader has opened it.
Or do we really use the delete-retry-loop on Windows servers?

The only situation I can think of is where a previous commit created
a revision file and didn't update 'current'. In which case it's unlikely
that the revision file will be open when the next commits moves a file
with the same name into place.

> Okay, so the cache reading procedure MUST change, to account for the
> non-packed cache file vanishing (by a concurrent packer) and the pack
> file appearing instead.  This is another layer of headache on top of the
> base one :-(

Do we support packing while the repository is live?
I haden't considered this.

Thanks for your feedback!

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
[ Added after reading through and replying to everything: overall this
seems good in a detail-by-detail reading[1]; please commit. ]

[1] the 'mile-high reading' I didn't do, but I think the last two days
of IRC discussions are a fine replacement to it.

Stefan Sperling wrote on Fri, Aug 26, 2011 at 13:14:32 +0200:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.
> 
> Please share your thoughts on this. I have most likely missed a few things.
> 
> If we like it I will put it into the repository for further editing,
> either in notes/ on trunk or onto the fsfs-successor ID branch.
> 

Add it as sibling of 'structure' on the branch?  That's its ultimate home.

> If this design has some severe problem I don't mind going back to
> the drawing board. This is my first shot, afterall :)
> 
> cmpilato, does this significantly differ from what you've done for BDB?
> Will both backends have the same (or similar) behaviour if we use
> this design for FSFS?
> 
> Thanks to neels and danielsh for all your input so far.
> You've been tremendously helpful!
> 
> I would also be happy to see alternative proposals.
> 
> 
> 
> FSFS successor IDs
> ==================
> 
> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> for what this is all about.
> 
> What follows are ideas about implementing successor ID support for FSFS.
> 
> Useful background information:
>   Node-revisions: subversion/libsvn_fs_base/notes/fs-history
>                   subversion/libsvn_fs_base/notes/structure
>   fsfs design: subversion/libsvn_fs_fs/structure
> 
> 
> The "cache" and its properties
> ==============================
> 
> Storage of successor IDs is essentially a cache of the result of the
> inversion of the predecessor relation between all node-revisions.
> This inversion is a *very* expensive operation (follow every node
> that ever existed in the repository from its most recent revision
> back to the revision it was created in).
> 
> In the following, "cache" refers to the store for successor IDs.
> 
> After a general introduction, a proposed cache design is presented.
> Note that the cache implementation will be hidden behind the API of
> libsvn_fs_fs and can be improved or changed in every new minor
> release of Subversion.
> 
> Required properties of successor ID cache for FSFS:
>  - FSFS-compatible:
>    - cache must use some plain file format
>    - writers do not block readers (i.e. sqlite is out of the question)
>    - cache must not rely on modifying existing revision files
>  - cache can be regenerated on demand

Regenerated *offline*, right?  ie, if the cache is lost it can be
derived from the revision files (which remain authoritative), but this
process need not withstand concurrent readers.  (The problem being the
need to decrement the cached-rev file)

Another point:
   - cache readers know how complete the data they read is

>  - do not corrupt the cache in case of an unexpected crash
>  - try to keep number of directory entries low
>  - conserve inodes by design (preferred), or via packing support
> 
> Nice-to-have properties:
>  - the repository can be live during cache generation

   - cache can be constructed incrementally
     (ie, amortize the construction among many readers/writers, rather
     than require a full history crawl to be done all at once)
> 
> 
> Storage size requirements
> =========================
> 
> Ballpark figures about node-revs from #svn-dev:
>   * cmpilato does some math.  There are over 350,000 node revisions in my
>       ancient copy of the svn.collab.net Subversion repos.
>   <hwright> there are over 20M of them in the ASF repo
> 
> How many successors does a given node-rev have?
> 
>   There is at most one successor on the same line of history (same copy_id).
>   Each copy operation also creates one new successor.
> 
> How big will the cache become, in total?
> 
>   The successor cache does not store entire node-revs, just node-rev IDs.
>   These are generally not longer than 40 bytes (less than half a line
>   in a 80x25 terminal being the rough estimate).
> 
>   The amount of raw information to store per node-revision is the
>   size of this mapping:
>     node_rev_id => successor_id
> 
>   Each entry is at least 80 bytes large, 40 bytes for the node_rev_id
>   and 40 bytes for the successor ID.
> 
>   Each new node-revision adds one entry to the map.
>   Since each successor is itself a node-revision, the entire cache
>   has at most as many entries as the amount of node-revs in the
>   repository.
> 

(Technically you can say "strictly less" entries here, since r0 always
contains a noderev that is not the successor of anything.)

>   More precisely, the amount of successors listed in the cache is
>   equal to the number of node revisions in the repository, minus
>   the number of nodes in HEAD (which don't presently have successors)
>   and minus the number of nodes which were deleted in history (they
>   don't have successors either).
>   

Do you mean "nodes" or "noderevs"?

Why are you subtracting the number of deletes?

>   In other words, the amount of successors in the cache is equal to
>   the amount of pred: entries in revision files.
> 
>   Let's try to plug in some numbers from the above svn.collab.net case.
> 
>   There are 350000 node revs. 
>     maximum size of successor IDs in cache = 40 bytes * 350000
>     maximum size of node-rev IDs in cache = 40 bytes * 350000
>     maximum size of cache = 40 bytes * 350000 * 2
> 
>   This amounts to roughly 28MB worst-case successor cache size for
>   the svn.collab.net repository (which itself is about 500MB in size).
> 
> 
> Implementation proposal
> =======================
> 
> 
> - On-disk File Format -
> 
> All files related to the cache are stored in the
> repos/db/successor-id-cache directory ("cache directory").
> 
> There is one file which contains the number of the revision the
> cache was last synced to: repos/db/successor-id-cache/cached-rev
> This is used by both readers and writers.
> 
> The directory also contains files named after revisions,
> and is sharded modulo some value TBD:
> 
>  successor-id-cache/0/1
>  successor-id-cache/0/2
>  successor-id-cache/0/3
>  ...
>  successor-id-cache/1/N
>  successor-id-cache/1/M
>  ...
> 
> These individual files store successors of all node-revs of a given revision
> (or, optionally, a number of revisions -- see notes on packing below).
> 
> Each file starts with an index, followed by zero or more successors-blocks
> ("s-block" for short).
> 
> The format of the index is shown below. It contains the index size
> (which is a fixed value) and one entry for each node_rev that appears
> in the revision(s) the file is responsible for.
> 
> The index size and offsets are 64 bit values.

Stored as?  ASCII decimal?  (If so: how is it delimited or padded?)  Big
endian binary?

> (32 bit index size would probably be sufficient, but let's not count peanuts.)
> node_rev_id is a variable length ASCII string terminated by '\0'.
> 

<bikeshed> s/\0/\n/? </bikeshed>

(That's consistent with the use of noderev id's in noderev headers in
the revision files; a header line is delimited by \n.)

>  +----------------------------------------------------------------------+
>  |                           index size                                 |
>  +----------------------------------------------------------------------+
>  | node_rev_id | offset to my first s-block | offset to my last s-block |
>  | node_rev_id | offset to my first s-block | offset to my last s-block |
>  | node_rev_id | offset to my first s-block | offset to my last s-block |

Just to clarify: both offsets are to the *first* byte of the s-block,
correct?

Are the offset relative to the file or to the end of the index?

>  |                               ...                                    |    
>  +----------------------------------------------------------------------+
>  |               zero padding to multiple of s-block size               |
>  +----------------------------------------------------------------------+


You pad [index size, array_of[noderevid offset offset]] rather than
just array_of[noderevid offset offset].  Correct?

(ie, sizeof(index size) + sizeof(array) + sizeof(padding) is what has to
be a power of 2, rather than only the sum of the last two addends.)

What is the size of an s-block?

> 
> If a node-rev does not have any successors, both s-block offsets are zero.
> This makes finding the answer to the question "does this node have any
> successors?" an O(1) operation (only one disk block needs to be read

[[colon] the index block]

> ).
> 
> If none of the node-revs have successors yet, the file ends here.
> 
> If a node-rev has successors, the file contains one or more s-blocks.

Corollary: the cache is updated after 'current' has been bumped,
finishing a commit.

> An s-block lists

[the noderev id's of ]

> successors of the node-rev and provides some amount of
> pre-allocated space for future successors. The offset to the first s-block
> helps readers, the offset to the last s-block helps writers (the next
> section provides details).
> 
> The padding at the end of the index ensures that all s-block boundaries
> lie on disk sector boundaries (more details later).
> 
> 
> The format of an s-block is shown below. It has a fixed size (SBLOCK_SIZE).
> The offsets are 64 bit values. successor_id is a variable length ASCII
> string terminated by '\0'.
> 

(same serialization questions as above)

>  +----------------------------------------------------------------------+
>  |                    offset of next s-block                            |

Perhaps store the length of this s-block?  That seems more robust to me.

>  +----------------------------------------------------------------------+
>  |             offset of free space in this s-block                     |

Offset relative to what?

(Please always say what offsets are relative to... that's as needed as
giving the units of measured quantities.)

>  +----------------------------------------------------------------------+
>  | successor_id, successor_id, ...                                      |

What is the separator between successor_id's?

(Perhaps it's 0x2C 0x20, but that's not quite clear from the text.)

>  +----------------------------------------------------------------------+
>  |              free space zero-padded to SBLOCK_SIZE                   |
>  +----------------------------------------------------------------------+
> 
> What is a good value for SBLOCK_SIZE?
> Modern disks have a block size of 8192 bytes (some disks even use this
> block size internally but report 512 bytes of block size to the OS).
> Using this block size means that each s-block will fit into one sector
> on modern disks, and into 16 sectors on older disks using 512 byte sectors.
> In both cases an s-block can be read in a single sequential read.
> 
> Recall that a successor_id will take about 40 bytes. Let's round that
> up to 64 for good measure. Applying advanced mathematics tells us that

(including any commas; per the above question)

> 8192 bytes will provide enough space for at least 128 successors of a
> node-rev-id. Which means we'll have to allocate a new s-block about every
> 128 successors

128 is the wrong figure, you overlooked the two offsets at the top of
the s-block.

> (usually a slightly higher amount of successors should fit).

Indeed, assuming that the successor_id's list has variable-sized
entries.  (ie, that the sucessors' noderev id's are not padded.)

> If more successors must fit SBLOCK_SIZE should be some multiple of 8192.
> 

Multiple of 2**13?  Just plain old 'power of 2' isn't good enough?

> If this s-block is the last s-block for this node-rev, the offset of
> the next s-block is zero.
> 
> The offset of free space within the s-block is used by readers when
> reading existing successors and by writers when appending new successors.
> (Note that in the implementation the offset of free space might also
> be a 16 bit number relative to the start of the s-block, instead of
> being a 64 bit number. But let's ignore such optimisations for now.) 
> 
> 
> 
> - Procedure for Readers -
> 
> A reader first reads the cached_rev file and remembers the revision number.
> 
> It then locates the cache file corresponding to the given node_rev_id.
> The name of the file is keyed on the revision of the node-rev. (Without
> packing the file has the same name as the revision. With packing the
> file is named after this revision modulo some fixed and known value.)
> 

Point being: filename = deterministic function(revnum)

> It opens this file and reads the index to learn the offset of the
> first s-block of the node_rev_id.
> 

[ I point out later, with packing, things aren't that simple, as the
function isn't deterministic; it's two-valued. ]

> If the offset is zero, the node-rev has no successors, and the search ends.
> 
> The reader seeks to the offset of the first s-block.
> 
> It reads the offset of free space within the s-block, and then reads
> successors from the s-block up to this offset.
> 
> If the offset to the next s-block is zero, the search ends.
> If the offset to the next s-block is non-zero, the reader
> proceeds to the next s-block.
> 

I suggested earlier that "next s-block" offsets be relative.  But if
they're absolute, I'd suggest here to check that they are later in the
file (to protect against infinite loops).

> When the search ends, the reader returns the list of successors
> and the number of the cached_rev, which indicates up to which
> revision the successor list is valid. (This is a lower bound; the
> successor list might already contain some values for newer revisions
> if a writer is concurrently updating the cache. However, there is no
> harm done if the reader uses this information because it pertains to
> already committed revisions.)
> 

Repeating for the list what I mentioned on IRC:

I envision that, in addition to this low level API that returns the
results up to some revision which it also returns, we'll have a wrapper
API (in fs-loader.c?) that, given a revnum rN, (a) calls the low-level
API, obtaining the successors up to some rM; and (b) uses other APIs ---
svn_fs_replay()? --- to compute (expensively) the successors from rM+1
to rN.

[ And how does the higher-level API deal with a situation where the
cache is a far cry behind HEAD?  It clearly shouldn't do a full history
crawl by default, and there ought to be numerous ways to achieve that. ]

> Note that there are only O(number of s-blocks) disk seeks involved.
> Any time data is read from disk one or more disk blocks are read sequentially.
> 

Oh, so the s-blocks for a given noderev might not all be contiguous?
Okay.

> 
> 
> - Procedure for Writers -
> 
> Note that the following describes how successor data for a single
> node_rev_id is updated. This is only a sub-step of the global cache
> synchronisation procedure (which is described in the next section).
> 
> A writer operates under the assumption that the entire cache directory
> has been locked for writing.
> 
> The writer locates the cache file corresponding to the given node_rev_id.
> 
> The writer opens this file and reads the index to learn the offset of the
> last s-block of the node_rev_id.
> 
> The writer seeks to the offset of the last s-block.
> 
> The writer reads the offset of remaining free space within the s-block,
> and calculates the amount of free space left within the s-block (recall
> that the s-block has fixed size SBLOCK_SIZE).
> 
> The writer writes successor IDs to free space within the s-block,

Q: How do you perform the writes in a manner safe to concurrent readers?

A: using the 'offset of free space', below.

Q: How do you update the 'offset of free space' in a manner that is
consistent to readers?

ie, if the offset was ASCII ' 800' and the new offset is '1600', you
must avoid a reader accidentally seeing '1800'.

(I seem to remember some ideas exchanged on the revprop-packing design
discussion --- including one that involved having two offsets written
down in the file, and one of them *always* being valid (but not always
the same one) --- perhaps Peter will remember...)

> and flushes the cache file to disk.
> 
> The writer updates the offset of free space in the s-block, and
> flushes the cache file to disk.
> 
> (At this point readers get to see the new successor IDs.)
> 
> 
> At any point, if remaining free space is too small for all new successors,
> the writer must allocate a new s-block.
> 

So, invariant: given a node-rev, all its s-blocks, except perhaps the
last (in order of appearance in the file), have at most <size of the
longest possible node-rev-id, minus one> free (zero) bytes at the end.

> To do so, it seeks to the very end of the file and appends a new s-block
> which initially contains only zeros.
> 
> The writer writes successors to free space in the new s-block.
> It updates the offset to free space in the new s-block, and flushes
> the cache file to disk.
> 

(If the new s-block runs out, loop and allocate yet another new
s-block.)

> It updates the 'next s-block' offset in the previous s-block to the
> offset of the new s-block, and flushes the cache file to disk.
> 

At what point does the writer update the cached-rev file?

> 
> If the writer crashes, the following cases can happen:
> 
>   - New successors have been added to an s-block, but the offset to
>     free space has not been updated.
>     => Newly added successors will be overwritten by the next writer,
>        and readers don't see them.
>   - A new s-block has been written, but the 'next s-block' offset at
>     the previous s-block has not been updated.
>     => An unused s-block will remain in the file.
>        The next writer adding successors of the same node-rev will
>        append a new s-block to the file. Readers don't see the unused s-block.
>        If the cache is rebuilt from scratch the unused s-block disappears.
> 

So, a lost block is in the file.  You could avoid that by seeking, not
to the end of the file but to 

SELECT (MAX(offset to my last s-block) + SIZEOF(s-block)) FROM index

(and, yes, I realize that this is a crash situation. two answers to
that: first, it's easy to avoid having an unused block; second, if
a file grows an unused block then a DOS can translate to running out of
disk space.)

> Thus, cache consistency is guaranteed even in case of crashes.
> 
> 
> 
> - Cache Synchronisation -
> 
> The procedure for synchronising the cache is as follows.
> Note that it assumes a write lock on the cache is held.
> 
> This procedure is run during post-commit processing after 'current'
> has been updated. It can also be run manually via svnadmin.
> 

(it should be run manually if the precedure failed to run during
post-commit FS processing for any reason)

> sync_successor_id_cache() {
>   open 'cached-rev' file
>   read $cached_rev from 'cached-rev' file
> 
>   open 'current' file
>   read $new_cached_rev from 'current' file
>   close 'current' file
> 

/me notes the order the two files are read

>   # The new_successors hash table is keyed by a node_rev_id.
>   # Its values are lists of successor IDs.
>   #
>   # The on-disk cache file format is designed for batch updates.
>   # It is inefficient to add new successors to cache files one by one.
>   # To update cache files efficiently, this table will grow up to a certain
>   # upper bound and be flushed to disk periodically.
>   new_successors = {}
> 
>   for rev in [$cached_rev, ..., $new_cached_rev] {
> 

for rev in [$cached_rev+1, ..., $new_cached_rev] {

>     for node_rev in $rev {
>       obtain predecessor p_node_rev of $node_rev
>       obtain revision p_rev from $p_node_rev

Should the cache note which of the successors is the M-successor (the
one with the same copy-id)?  Perhaps by stipulating that the M-successor
is always the first in the list of successors, and leaving room for it
if a C-successor (a copy) is created before the M-successor.

>       if cache file $p_rev exists {
> 
>         # this uses the 'reader' procedure as explained above
> 	look up $p_node_rev:$node_rev in cache file
> 
>         if $p_node_rev:$node_rev is already listed
> 	  continue with next node_rev
>         else
> 	  new_successors[$p_node_rev] += $node_rev
>       } else {
> 	new_successors[$p_node_rev] += $node_rev
>       }
>     }
> 
>     if space required by new_successors reached upper bound {

or if we are on the last iteration on the loop
(to avoid the code duplication later)

>       for node_rev in new_successors.keys() {
>         # update_cache_file() implements the 'writer' procedure explained above
>         update_cache_file($node_rev, new_successors[$node_rev])
>       }

So, you scan from (cached_rev + 1) to (youngest) and group the writes by
the predecessor's revision.  (Or perhaps by the predecessor's revision shard, if
you're to write a pack file directly?)  Okay.  The tradeoff is the next
line --- that cached-rev can only be updated after all those revision's
new noderevs have been added to the cache, without checkpoints in the
middle of that range.

And the 'upper bound' addresses the case when (youngest - cached_rev) is
large.  Okay.

(It does mean we have two codepaths though, eg both packed and
non-packed shards; and <cue wild idea> perhaps it'll be easier to have
*only* packed shards, and to use the expensive replay method to answer
queries about revisions younger than the latest pack file?  (We may want
to use a smaller shard size in this case.)

And, by the way, this idea isn't /so/ unwieldy.  I think that you can
use *only* pack files --- and also use pack files for unfinished shards
--- if, when you run out of INDEX_SIZE, you rewrite the whole file and
move-into-place a new pack file (with a doubled INDEX_SIZE) on top of
the old pack file.  The advantage here would be not having the
both-packed-and-nonpacked-shards headache to deal with.)

>       write $rev into 'cached-rev' file
>       flush 'cached-rev' file

No, you don't flush, you move-into-place.

>       new_successors = {}
>     }
>   }
> 
>   if new_successors != {} {
>     for node_rev in new_successors.keys() {
>       # update_cache_file() implements the 'writer' procedure explained above
>       update_cache_file($node_rev, new_successors[$node_rev])
>     }
>     write $new_cached_rev into 'cached-rev' file
>   }
> 
>   close 'cached-rev' file
> }
> 
> 
> If this code crashes, the following can happen:
>   - The cache contains successors computed for revisions beyond the
>     revision listed in the 'cached-rev' file.
>     => The next writer will not add these successors again. No harm done.

Or perhaps one of the successors was partially added, in which case we
fall back to the "the 'free space' offset hadn't been updated" point
from above.  (and the bytes would be overwritten by identical bytes,
most likely)

I think this is fine, but I'd like to read it again to be surer.

> 
> 
> - Packing -
> 
> The cache can optionally be packed. This means that successors of
> node_revs from more than one revision live in the same cache file.
> 

Okay.  I missed your terminology at first, but I see now that you mean
'sharding' and 'packing' in exactly the same sense they are used today
for revision files.

> The on-disk format and cache synchronisation procedure does not change.
> The only thing that changes is the procedure to look up the name of the
> cache file responsible for a given node-rev. The cache file is named
> after the node-rev's revision modulo some fixed and known value.
> 

This is different from how revision shard files are named: $FS/revs/8/8810

> To create the index of a packed cache file the combined number of
> node-revs in all revisions that share the same cache file must be known.
> Thus, a packed cache file can only be created after all revisions
> which share the cache file have been committed.
> 
> Packing can be performed with 'svnadmin pack'.
> 

Okay, so the cache reading procedure MUST change, to account for the
non-packed cache file vanishing (by a concurrent packer) and the pack
file appearing instead.  This is another layer of headache on top of the
base one :-(

[ but see above for an idea that avoids this headache by using pack
files for unfinished shards too ]

> 
> - How does this design live up to our requirements? -
> 
>  - FSFS-compatible:
>    - cache must use some plain file format
>        => yes
>    - writers do not block readers
>        => yes, a reader will get consistent data even while the cache
>           is being written to
>    - cache must not rely on modifying existing revision files
>        => yes
>  - cache can be regenerated on demand
>      => yes

   - cache readers know how complete the data they read is
      => yes, via the returned their value of cached-rev

>  - do not corrupt the cache in case of an unexpected crash
>      => the cache can only contain more information than claimed by
>         the 'cached-rev' file
>  - try to keep number of directory entries low
>      => yes, via sharding
>  - conserve inodes by design (preferred), or via packing support
>      => yes, via optional packing
> 
>  - the repository can be live during cache generation
>      => yes, but readers will not see all successors until the cache
>         is fully populated

   - cache can be constructed incrementally
     (ie, amortize the construction among many readers/writers, rather
     than require a full history crawl to be done all at once)

Well, it /can/ be constructed with the smallest increment being "one
revision of successors", and that's what the post-commit FS processing
will do.  The same procedure when run manually is optimized for large
revision ranges; that is a useful property for the initial construction
of a cache.

Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Sperling wrote on Fri, Aug 26, 2011 at 13:14:32 +0200:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.

By the way: yours was a 16KB email and my reply was 25KB.  Can we try
and keep the reply mails short and mutually independent?  For me, at
least, it's easier to handle twenty-five 1KB emails than one 25KB email.
Thanks.

Re: FSFS successor ID design draft

Posted by Julian Foad <ju...@wandisco.com>.
On Tue, 2011-08-30, C. Michael Pilato wrote:
> On 08/30/2011 12:34 PM, Hyrum K Wright wrote:
> >>  There is at most one successor on the same line of history (same copy_id).
> >>  Each copy operation also creates one new successor.
> > 
> > I think we need to be bit more clear about when a successor is
> > actually created in the case of copy.  For most copies, particularly
> > those of the recursive directory kind, a new successor isn't created
> > until we write to the node.  This has some interesting implications.
> > 
> > For instance, folks usually don't modify the contents of a tag, so
> > there wouldn't be any successor link created for the tag contents.
> > Even though the tag contents are "logical" successor of their source
> > files, they aren't actually stored as such.  Doing so would make
> > copying a O(N) operation instead of O(1).  (Of course, the O(1)
> > behavior gives incomplete results when asking "where did this bug move
> > to?")

So, IIUC, to answer the "Where did this bug move to?" question
correctly, we need to examine the successors of each parent directory of
the file in question as well as of the file itself.  That sounds totally
fine.

- Julian


> > Mike may have already worked most of this out on the BDB branch.
> 
> Sure, the BDB code adds successors only where it also would add precessors.
>  If you create a tag, you create a copy of some directory.  That copy has a
> predecessor "link" back to the directory it was copied from; the directory
> it was copied from as a successor "link" to the new tag.



Re: FSFS successor ID design draft

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Aug 30, 2011 at 03:25:18PM -0400, C. Michael Pilato wrote:
> > In reading through this, as well as the discussion in IRC, I'm once
> > again wondering why we're bolting this stuff onto the outside of FSFS
> > rather than rethinking the entire FS problem (along with things like
> > obliterate and move-to storage and ...).
> 
> You and me both, brother.

I wanted to address this point briefly, not silently ignore it.

Yes, a new FS design could address the successors problem, and many
other problems, too.

I am not in the position to drive design discussion for an entirely new
FS design. If someone wants to do that, fine. I'd sure like to help out.

I want to try to get something working for FSFS because this is
much more likely to get done in time for 1.8. If this turns out to
be impossible, fair enough -- that means we'll have to tackle a new FS
design to fix this problem. But I am not yet convinced that it is impossible.

Re: FSFS successor ID design draft

Posted by "C. Michael Pilato" <cm...@collab.net>.
On 08/30/2011 12:34 PM, Hyrum K Wright wrote:

>>  There is at most one successor on the same line of history (same copy_id).
>>  Each copy operation also creates one new successor.
> 
> I think we need to be bit more clear about when a successor is
> actually created in the case of copy.  For most copies, particularly
> those of the recursive directory kind, a new successor isn't created
> until we write to the node.  This has some interesting implications.
> 
> For instance, folks usually don't modify the contents of a tag, so
> there wouldn't be any successor link created for the tag contents.
> Even though the tag contents are "logical" successor of their source
> files, they aren't actually stored as such.  Doing so would make
> copying a O(N) operation instead of O(1).  (Of course, the O(1)
> behavior gives incomplete results when asking "where did this bug move
> to?")
> 
> Mike may have already worked most of this out on the BDB branch.

Sure, the BDB code adds successors only where it also would add precessors.
 If you create a tag, you create a copy of some directory.  That copy has a
predecessor "link" back to the directory it was copied from; the directory
it was copied from as a successor "link" to the new tag.

> In reading through this, as well as the discussion in IRC, I'm once
> again wondering why we're bolting this stuff onto the outside of FSFS
> rather than rethinking the entire FS problem (along with things like
> obliterate and move-to storage and ...).

You and me both, brother.

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand


Re: FSFS successor ID design draft

Posted by Daniel Shahaf <da...@elego.de>.
Neels J Hofmeyr wrote on Wed, Aug 31, 2011 at 03:17:26 +0200:
> On 08/30/2011 06:34 PM, Hyrum K Wright wrote:
> > In reading through this, as well as the discussion in IRC, I'm once
> > again wondering why we're bolting this stuff onto the outside of FSFS
> > rather than rethinking the entire FS problem (along with things like
> > obliterate and move-to storage and ...).  I realize we can't do
> > *everything*, but these feels strangely like libsvn_wc from 5 or 6
> > years ago.  Is there a compelling reason to reinvent the database?
> 
> I know nothing of other extensions to the fsfs database, but this is how my
> thought process early-outed from suggesting an all-new fsfs db because of
> successors:
> 
> A large part of FSFS should be read-only, grow-at-the-end-only, so that we
> don't need to lock out readers while writing. However, successors are little
> bits of info *later* added to random spots of the big read-only part, like
> dusty strings continuously growing off of the impenetrable wall of
> revisions, as new solid revision bricks sprinkle successor ids from the top.
> No matter how we wriggle it, the successors will probably always be stored
> separately, of sorts bolted on to the RO part...
> 

I don't follow.  Storing a noderev's successors in a file named after
the noderev's id would be append-only; the problem here is that we'll
have tons of writeable files, not that the writeable files won't be
append-only.

> A new format doesn't solve the underlying separation -- unless it's niftier
> than me.
> 
> ~Neels
> 



Re: FSFS successor ID design draft

Posted by Hyrum K Wright <hy...@wandisco.com>.
On Tue, Aug 30, 2011 at 8:17 PM, Neels J Hofmeyr <ne...@elego.de> wrote:
> On 08/30/2011 06:34 PM, Hyrum K Wright wrote:
>> In reading through this, as well as the discussion in IRC, I'm once
>> again wondering why we're bolting this stuff onto the outside of FSFS
>> rather than rethinking the entire FS problem (along with things like
>> obliterate and move-to storage and ...).  I realize we can't do
>> *everything*, but these feels strangely like libsvn_wc from 5 or 6
>> years ago.  Is there a compelling reason to reinvent the database?
>
> I know nothing of other extensions to the fsfs database, but this is how my
> thought process early-outed from suggesting an all-new fsfs db because of
> successors:
>
> A large part of FSFS should be read-only, grow-at-the-end-only, so that we
> don't need to lock out readers while writing. However, successors are little
> bits of info *later* added to random spots of the big read-only part, like
> dusty strings continuously growing off of the impenetrable wall of
> revisions, as new solid revision bricks sprinkle successor ids from the top.
> No matter how we wriggle it, the successors will probably always be stored
> separately, of sorts bolted on to the RO part...

I'll get pedantic here and point out that while your first statement
may be true for FSFS specifically, it is not true for a more general
solution.  You essentially state that "if we don't want writers to
block readers, FSFS should be read-only, append-only."  But the
property of read-only, append-only is not a strict requirement for
non-blocking writers.  (This requirement may exist for FSFS, but the
world is quite a bit larger than just FSFS. :) )

In other words, there exist systems in which writers don't block
readers that use strategies other than read-only, append-only.
Row-level locking in an RDMS comes to mind.  While I know this thread
is specifically about FSFS, let's keep our minds open. :)  <insert
plea to not reinvent the database here>

I'll also note that FSFS writers *do* block readers for a short period
(the "serialization window"), so the problem is actually one of
minimizing the length of time the process spends in the window, not
doing away with it entirely.

-Hyrum


-- 

uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com/

Re: FSFS successor ID design draft

Posted by Neels J Hofmeyr <ne...@elego.de>.
On 08/30/2011 06:34 PM, Hyrum K Wright wrote:
> In reading through this, as well as the discussion in IRC, I'm once
> again wondering why we're bolting this stuff onto the outside of FSFS
> rather than rethinking the entire FS problem (along with things like
> obliterate and move-to storage and ...).  I realize we can't do
> *everything*, but these feels strangely like libsvn_wc from 5 or 6
> years ago.  Is there a compelling reason to reinvent the database?

I know nothing of other extensions to the fsfs database, but this is how my
thought process early-outed from suggesting an all-new fsfs db because of
successors:

A large part of FSFS should be read-only, grow-at-the-end-only, so that we
don't need to lock out readers while writing. However, successors are little
bits of info *later* added to random spots of the big read-only part, like
dusty strings continuously growing off of the impenetrable wall of
revisions, as new solid revision bricks sprinkle successor ids from the top.
No matter how we wriggle it, the successors will probably always be stored
separately, of sorts bolted on to the RO part...

A new format doesn't solve the underlying separation -- unless it's niftier
than me.

~Neels


Re: FSFS successor ID design draft

Posted by Hyrum K Wright <hy...@wandisco.com>.
I hope I'm not late to the party, but here are a couple of drive-by thoughts.

On Fri, Aug 26, 2011 at 6:14 AM, Stefan Sperling <st...@elego.de> wrote:
> Below is an initial draft of a design for successor-IDs in FSFS.
> It is a bit long, but I hope it covers all relevant points in
> sufficient detail.
>
> Please share your thoughts on this. I have most likely missed a few things.
>
> If we like it I will put it into the repository for further editing,
> either in notes/ on trunk or onto the fsfs-successor ID branch.
>
> If this design has some severe problem I don't mind going back to
> the drawing board. This is my first shot, afterall :)
>
> cmpilato, does this significantly differ from what you've done for BDB?
> Will both backends have the same (or similar) behaviour if we use
> this design for FSFS?
>
> Thanks to neels and danielsh for all your input so far.
> You've been tremendously helpful!
>
> I would also be happy to see alternative proposals.
>
>
>
> FSFS successor IDs
> ==================
>
> See http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> for what this is all about.
>
> What follows are ideas about implementing successor ID support for FSFS.
>
> Useful background information:
>  Node-revisions: subversion/libsvn_fs_base/notes/fs-history
>                  subversion/libsvn_fs_base/notes/structure
>  fsfs design: subversion/libsvn_fs_fs/structure
>
>
> The "cache" and its properties
> ==============================
>
> Storage of successor IDs is essentially a cache of the result of the
> inversion of the predecessor relation between all node-revisions.
> This inversion is a *very* expensive operation (follow every node
> that ever existed in the repository from its most recent revision
> back to the revision it was created in).
>
> In the following, "cache" refers to the store for successor IDs.
>
> After a general introduction, a proposed cache design is presented.
> Note that the cache implementation will be hidden behind the API of
> libsvn_fs_fs and can be improved or changed in every new minor
> release of Subversion.
>
> Required properties of successor ID cache for FSFS:
>  - FSFS-compatible:
>   - cache must use some plain file format
>   - writers do not block readers (i.e. sqlite is out of the question)
>   - cache must not rely on modifying existing revision files
>  - cache can be regenerated on demand
>  - do not corrupt the cache in case of an unexpected crash
>  - try to keep number of directory entries low
>  - conserve inodes by design (preferred), or via packing support
>
> Nice-to-have properties:
>  - the repository can be live during cache generation
>
>
> Storage size requirements
> =========================
>
> Ballpark figures about node-revs from #svn-dev:
>  * cmpilato does some math.  There are over 350,000 node revisions in my
>      ancient copy of the svn.collab.net Subversion repos.
>  <hwright> there are over 20M of them in the ASF repo

Correction: There are roughly 7.2M nodes in the ASF repo (or there
were before the OpenOffice.org import a few days ago).

> How many successors does a given node-rev have?
>
>  There is at most one successor on the same line of history (same copy_id).
>  Each copy operation also creates one new successor.

I think we need to be bit more clear about when a successor is
actually created in the case of copy.  For most copies, particularly
those of the recursive directory kind, a new successor isn't created
until we write to the node.  This has some interesting implications.

For instance, folks usually don't modify the contents of a tag, so
there wouldn't be any successor link created for the tag contents.
Even though the tag contents are "logical" successor of their source
files, they aren't actually stored as such.  Doing so would make
copying a O(N) operation instead of O(1).  (Of course, the O(1)
behavior gives incomplete results when asking "where did this bug move
to?")

Mike may have already worked most of this out on the BDB branch.

> How big will the cache become, in total?
>
>  The successor cache does not store entire node-revs, just node-rev IDs.
>  These are generally not longer than 40 bytes (less than half a line
>  in a 80x25 terminal being the rough estimate).
>
>  The amount of raw information to store per node-revision is the
>  size of this mapping:
>    node_rev_id => successor_id
>
>  Each entry is at least 80 bytes large, 40 bytes for the node_rev_id
>  and 40 bytes for the successor ID.
>
>  Each new node-revision adds one entry to the map.
>  Since each successor is itself a node-revision, the entire cache
>  has at most as many entries as the amount of node-revs in the
>  repository.
>
>  More precisely, the amount of successors listed in the cache is
>  equal to the number of node revisions in the repository, minus
>  the number of nodes in HEAD (which don't presently have successors)
>  and minus the number of nodes which were deleted in history (they
>  don't have successors either).
>
>  In other words, the amount of successors in the cache is equal to
>  the amount of pred: entries in revision files.
>
>  Let's try to plug in some numbers from the above svn.collab.net case.
>
>  There are 350000 node revs.
>    maximum size of successor IDs in cache = 40 bytes * 350000
>    maximum size of node-rev IDs in cache = 40 bytes * 350000
>    maximum size of cache = 40 bytes * 350000 * 2
>
>  This amounts to roughly 28MB worst-case successor cache size for
>  the svn.collab.net repository (which itself is about 500MB in size).
>
>
> Implementation proposal
> =======================
>...

In reading through this, as well as the discussion in IRC, I'm once
again wondering why we're bolting this stuff onto the outside of FSFS
rather than rethinking the entire FS problem (along with things like
obliterate and move-to storage and ...).  I realize we can't do
*everything*, but these feels strangely like libsvn_wc from 5 or 6
years ago.  Is there a compelling reason to reinvent the database?

Anyway, I probably won't be doing the actual work, so you can ignore
my comments if you wish, but I figured I'd be remiss if I didn't throw
them out there. :)

-Hyrum


-- 

uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com/