You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2003/07/05 08:40:19 UTC

A forward-thinking thought about deltas in FS design

Tonight I had a fascinating thought.  We've always taken it as
axiomatic that the most recent rev of a file should be stored in plain
text, and earlier revs should be stored as deltas against it, so that
checkouts of the most recent rev are fast.  That's what CVS does (on
the mainline, anyway), and it was certainly important back when we had
no delta combiner and no skip-deltas.

But maybe that's not necessary, with those advanced techniques.  Maybe
we should just store new revs as deltas against older revs, using the
skip-delta technique to ensure that you only need to go through
O(lg(n)) deltas to get to any revision.  (The simplest application of
the technique is: rev 0 is plain-text or self-compressed; for any
other rev, flip the least significant 1 bit in the revnum and store as
a delta against the resulting rev.)

Pros:

  * Beautiful, sweet simplicity.  All data entered into the database
    stays there for all time (modulo "svn obliterate"), and is never
    replaced.  There is no post-commit step where you go back and
    redeltify stuff.  This simplicity is particularly interesting if
    you're trying to write an FS back end which doesn't use a
    database--for instance (not necessarily the best idea), you could
    imagine implementing the FS as a single file which is only
    appended to.

  * Commits becomes faster, though our commits are already pretty fast
    in theory.  Commit performance is also more predictable for large
    files.  (You don't have this redelfication step which might be
    slow because some earlier rev of the file was big.)

Cons:
 
  * Checkouts become slower--not by a huge factor, but maybe by a
    noticeable constant factor.  For most repositories, checkouts are
    much more common than checkins, so this is a big deal.  (However,
    it's easier to make a simpler system perform well, so the tradeoff
    may not be so obvious.)

  * Balancing space and time is harder for the all-important first few
    revs.  But I think it's not too bad for normal workloads.

I'm not seriously suggesting that anyone run out and try implementing
this (although it would probably be quite easy, and performance
numbers would be intriguing), especially since it is not much more
painful to try after 1.0.  But it's an interesting idea and I wanted
to get it into the list archives for future reference.  (In
particular, I think the benefits may be highly compelling for
libsvn_fs_fs, if that ever happens.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Stefan Monnier <mo...@cs-www.cs.yale.edu>.

> You can really only make the strong guarantees if you always store a
> skip-delta immediately upon commit and never throw them away.

I'd expect that the currently committed revision (i.e. the latest) would
simply always be immediately added to the cache.  I don't understand your
comment about skip-deltas since AFAIU there wouldn't be any such thing
on-disk any more (you'd only have one-step-deltas and a cache composed of
actual files).

> That's a very degenerate case of "caching."

It's very common to do such opportunistic caching.

> For a large project, this means for that period of time (which may well
> be days, not hours) the system doesn't work.  If it takes two days
> instead of half an hour to check out the latest version of Mozilla, the
> system is not carrying on gracefully.

I must admit that I have no idea what impact the skip-deltas have
in reality.  But my experience with CVS is that retrieving revision
1.1 from the repository (which is O(n)) is not 96 times slower
as your example assumes.  Maybe Mozilla's repository has an `n'
much larger than Emacs's.


        Stefan


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Greg Hudson <gh...@MIT.EDU>.

On Mon, 2003-07-07 at 02:21, Tom Lord wrote:
>     > You don't typically get performance "guarantees" with a
>     > cache because you don't know if it has been populated with the fast path
>     > you want since the last time it has been thrown away.
> 
> That's false.
> 
> You have to consider (a) your algorithm that decides what to cache and
> what to throw away;  (b) the size of your cache.
> 
> Under (often quite reasonable) interactions of (a) and (b), you can
> make strong guarantees.

You can really only make the strong guarantees if you always store a
skip-delta immediately upon commit and never throw them away.  That's a
very degenerate case of "caching."  (I grant that if you throw away a
miniscule fraction of the skip-delta information you will, in some
cases, only get a minor degradation in read performance.  But because
you'd get an even less significant improvement in throughput or space,
there is no advantage in thinking of the skip-delta store as a cache.)

> Given a cache such as I described, the hard guarantees you're worried
> about obtain.   You can back it up if you like.   You can so send in a
> junior-clueless-admin to run "rm -rf" on a portion of the cache -- and
> then what happens?   the performance guarantees are lost for a few
> hours.  the system gracefully carries on anyway.   then everything is
> back to normal.

For a large project, this means for that period of time (which may well
be days, not hours) the system doesn't work.  If it takes two days
instead of half an hour to check out the latest version of Mozilla, the
system is not carrying on gracefully.

>     > > The caching/memoizing approach would make it
>     > > easier, I think, to be computing skip-deltas concurrently and
>     > > opportunistically.
> 
>     > Why would you want to?  
> 
> For better actual, real-world performance.   

You're skipping too many steps here.  How does it result in better
real-world performance to compute a single delta and then (concurrently
or "opportunistically," whatever that means) compute the skip delta you
need for fast access?

>     > It doesn't take more time (on average) to
>     > compute the skip-delta than it does to compute the single delta.
> 
> That isn't the point.   It's not relevant to anything I said.

When you skip too many steps, I have to guess at your intermediate
thought processes.  When I guess wrong, I may say something irrelevant
to what you were thinking, though it's certainly relevant to something
you could have been thinking given what you wrote.  In this case, I was
theorizing that you thought one could get better throughput in the
primary store by taking less time to compute a single delta, an argument
which doesn't hold.  I still don't know what your actual argument is on
this point.

(Perhaps you're saying you'd get faster throughput in the primary store
because you have to write less data?  That's too simplistic; you don't
get faster compression throughput by running bzip2 than you do with
gzip, even though you're writing less data.)

>     > > Read-only mirrors and clients could be building their own caches.

>     > Again, no advantage versus my approach; read-only mirrors and clients
>     > could be building their own caches there too.

What I should have said here is: in my approach, a read-only mirror of
the primary repository gets fast access just like the primary does. 
Being able to build your own cache of a slow primary isn't any better.

> In general, you seem to be armed with a lot more basic theory than
> non-basic theory or practice.

I'm not really interested in having an ad hominem argument with you.

>     > > Given the kinds of access needed to "row 0",
>     > > perhaps it could be stored more compactly and updated more quickly.
> 
>     > I don't see this translating into a practical advantage.
> 
> Um, much faster and more space-efficient write-txns?

I don't think the write transactions become faster.  In my approach, you
pick an appropriate rev to delta against, produce it (which is fast,
because producing a given rev is always fast), compute the delta, and
store that.  In your approach, you pick the previous rev, produce it,
compute the delta, and store.  Computing a delta between two dissimilar
files doesn't take more time than computing a delta between two similar
files, so we've taken the same length of time.  (Perhaps you're
implicitly arguing that the client can do the work of computing the
single delta.  Subversion isn't currently structured to allow that, and
the server probably couldn't ensure database integrity if it were, so
I'll assume that you aren't making that argument.)  Then you also
compute a skip-delta and store it (you can't allow this process to fall
behind if you want guaranteed acceptable performance) which takes
additional time.

The write transactions to the primary store do take less space in your
scheme, by an average factor of lg(N) in the worst comparative case (I
am curious: in real projects, how much larger is the average delta
across M revs of a file than the average delta across one rev, as a
function of M?).  But then you also have to go and add data to the
cache.  So where is the practical advantage?  If you want the
performance guarantees of the skip-deltas, you have to accept the cost;
splitting them apart does not make the total cost any less.

>     > > A cache could easily revert to full-texts when skip-deltas become
>     > > irrational.
> 
>     > Why would they be any more likely to become irrational than single
>     > deltas would be?
> 
> Hello?!?!   Once again, large skip-deltas on busy files are going to
> converge on full-text sizes.

I think I understand what you're arguing.  But if a skip-delta turns out
to be larger than the plain text (should never happen except by a tiny
margin, in the degenerate case where the target data is uncompressable
and gets no help at all from the source), it's just as easy to
substitute plain text in my scheme as it is in yours.

>     > You've proposed a tiny little simplification of the primary storage
>     > manager at the expense of decomposing the storage system into two parts,
>     > radically increasing the overall complexity of the system.  Not a win.

> ?

I don't see what's so hard to understand about this.  A system composed
of a primary store and a cache is more complicated than a system
composed of only a primary store, unless both parts of the decomposed
system can be radically simpler than the unified primary store.  Since
you have proposed only a tiny simplfication to the primary store, your
overall system would be more complicated.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Tom Lord <lo...@emf.net>.


    > From: kfogel@collab.net

    > > You're oversimplifying quite a bit.   I'm not even sure where to begin
    > > debugging you.  I'll try anyway.

    > Whoa, Tom.  Full stop.

You talking to your horse?   Should I perceive a personal insult here?


    > (and don't try to claim that "I'm not even sure where to
    > begin debugging you" is not insulting, you'd just cause a lot of
    > laughter around the world).

Wow.


-t


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by kf...@collab.net.

Tom Lord <lo...@emf.net> writes:
> You're oversimplifying quite a bit.   I'm not even sure where to begin
> debugging you.  I'll try anyway.

Whoa, Tom.  Full stop.

Meta-discussion is not how I like to spend my time, any more than you
do.  Unfortunately, when you start writing stuff like this, I'm sort
of obliged to open a meta-discussion :-(.

If you think Greg's missed a technical point, say something like "I
think you misunderstood what I was saying..." or "No, that's not what
I meant, let me try another way..."  But personal insults are way out
of bounds (and don't try to claim that "I'm not even sure where to
begin debugging you" is not insulting, you'd just cause a lot of
laughter around the world).

Greg H has enough self-confidence that he won't be discouraged by your
bad manners, I'm sure.  But your tone affects the whole list, not just
him.  Please help keep this mailing list civilized.  Believe me, you
wouldn't be the only person who has to restrain himself from time to
time; see Greg's very even-tempered responses for an example.

Thank you,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Tom Lord <lo...@emf.net>.


    > From: Greg Hudson <gh...@MIT.EDU>

    > > Have you considered changing the top row of your diagram [to]:

    > >   0 <- 1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 7 <- 8 <- 9 <- 10 <- 11 <- 12 ....

    > Yes, but I don't think there's an advantage.

I pointed out several.

    > > So if you cached the other arrows in your diagram, and the cache is
    > > given roughly the same amount of disk space as is needed for all the
    > > arrows in your diagram, then the cache will function as a memo, and
    > > you'll have the same performance guarantees that you have now.

    > So, I'm never quite sure how you use the word "cache".  In my world, a
    > cache is (1) something you populate on demand, and (2) something you can
    > throw away.  

That's correct.

    > You don't typically get performance "guarantees" with a
    > cache because you don't know if it has been populated with the fast path
    > you want since the last time it has been thrown away.

That's false.

You have to consider (a) your algorithm that decides what to cache and
what to throw away;  (b) the size of your cache.

Under (often quite reasonable) interactions of (a) and (b), you can
make strong guarantees.


    > For guaranteed acceptable performance, you cannot treat fast path data
    > as expendable.  It is too important for ensuring reasonable access times
    > and it takes too long to rebuild.  So it has to be updated reliably, it
    > has to be backed up just as well, it has to be replicated just as well.

You're oversimplifying quite a bit.   I'm not even sure where to begin
debugging you.  I'll try anyway.

Given a cache such as I described, the hard guarantees you're worried
about obtain.   You can back it up if you like.   You can so send in a
junior-clueless-admin to run "rm -rf" on a portion of the cache -- and
then what happens?   the performance guarantees are lost for a few
hours.  the system gracefully carries on anyway.   then everything is
back to normal.



    > So either you're using the word "cache" loosely 

No -- you're using it to mean more than it means.   


    > > The caching/memoizing approach would make it
    > > easier, I think, to be computing skip-deltas concurrently and
    > > opportunistically.

    > Why would you want to?  

For better actual, real-world performance.   


    > It doesn't take more time (on average) to
    > compute the skip-delta than it does to compute the single delta.

That isn't the point.   It's not relevant to anything I said.


    > > Read-only mirrors and clients could be building their own caches.

    > Again, no advantage versus my approach; read-only mirrors and clients
    > could be building their own caches there too.


In general, you seem to be armed with a lot more basic theory than
non-basic theory or practice.


    > > Given the kinds of access needed to "row 0",
    > > perhaps it could be stored more compactly and updated more quickly.

    > I don't see this translating into a practical advantage.

Um, much faster and more space-efficient write-txns?



    > > A cache could easily revert to full-texts when skip-deltas become
    > > irrational.

    > Why would they be any more likely to become irrational than single
    > deltas would be?

Hello?!?!   Once again, large skip-deltas on busy files are going to
converge on full-text sizes.


    > You've proposed a tiny little simplification of the primary storage
    > manager at the expense of decomposing the storage system into two parts,
    > radically increasing the overall complexity of the system.  Not a win.


?

-t


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Greg Hudson <gh...@MIT.EDU>.

On Sun, 2003-07-06 at 12:18, Tom Lord wrote:
> Have you considered changing the top row of your diagram [to]:

>   0 <- 1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 7 <- 8 <- 9 <- 10 <- 11 <- 12 ....

Yes, but I don't think there's an advantage.

> So if you cached the other arrows in your diagram, and the cache is
> given roughly the same amount of disk space as is needed for all the
> arrows in your diagram, then the cache will function as a memo, and
> you'll have the same performance guarantees that you have now.

So, I'm never quite sure how you use the word "cache".  In my world, a
cache is (1) something you populate on demand, and (2) something you can
throw away.  You don't typically get performance "guarantees" with a
cache because you don't know if it has been populated with the fast path
you want since the last time it has been thrown away.

For guaranteed acceptable performance, you cannot treat fast path data
as expendable.  It is too important for ensuring reasonable access times
and it takes too long to rebuild.  So it has to be updated reliably, it
has to be backed up just as well, it has to be replicated just as well.

So either you're using the word "cache" loosely (a better word would be
"augmentation"), or you're fast-talking around the GAP requirement.

> The caching/memoizing approach would make it
> easier, I think, to be computing skip-deltas concurrently and
> opportunistically.

Why would you want to?  It doesn't take more time (on average) to
compute the skip-delta than it does to compute the single delta.

> Read-only mirrors and clients could be building their own caches.

Again, no advantage versus my approach; read-only mirrors and clients
could be building their own caches there too.

> Given the kinds of access needed to "row 0",
> perhaps it could be stored more compactly and updated more quickly.

I don't see this translating into a practical advantage.

> A cache could easily revert to full-texts when skip-deltas become
> irrational.

Why would they be any more likely to become irrational than single
deltas would be?

> In a long-lived, busy archive, I'm not so sure the
> performance guarantee you are after is really worth more than a
> guarantee of _better_ performance where the access patterns are hot --
> and the caching approach can make that trade-off smoothly (if it takes
> me 3x as long to retrieve a 7-year old kernel that nobody else has
> fetched for years, but 1/2 as long to get last months ....).

We're not talking about a factor of 3x here; we're talking about O(n)
versus O(lg(n)), which can quickly become a factor of 100 or 1000.

> The simpler row-0 representation

From the point of view of the current Subversion design, the single
delta representation isn't really any simpler.  A rep is either plain
text or a delta against another rep; skip-deltas just change what the
"other rep" is.  All the complexity of skip-deltas (which isn't very
much) is in the algorithm, not on disk.  (Tiny exception: to implement
skip deltas, I had to add a predecessor count to node-revisions.)

> , being the only txnal data needed, can
> help stabilize the implementation of the storage manager and simplify
> its design.  The cache implementation could be made independent of the
> primary storage manager internals -- making it easier to write new
> storage manager implementations (and new caching-engine
> implementations).

You've proposed a tiny little simplification of the primary storage
manager at the expense of decomposing the storage system into two parts,
radically increasing the overall complexity of the system.  Not a win.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Tom Lord <lo...@emf.net>.


    > From: Greg Hudson <gh...@MIT.EDU>
    > > Once upon a time, back in the mid-80s I worked on a commercial VC tool
    > > called SVM (Seidl Version Manager). Where were one of the few VC tools
    > > for the PC (other main ones being PVCS, TLIB, and a relative newcomer
    > > named MKS ;-). Ours was the only one of that bunch that used
    > > forward-deltas. We overcame the performance issue by keeping a cache
    > > of the most recent N versions in a cache area (where N could be
    > > configured per system, as could the set of files for which to keep the
    > > cache - if I recall correctly anyway :)

    > Caching fails to ensure guaranteed acceptable performance, which I think
    > is an important property of an FS back end.

_That_ kind of caching fails to ensure the performance guarantee you
are worried about, but related techniques do not.

Have you considered changing the top row of your diagram:


  0 <- 1  2 <- 3  4 <- 5  6 <- 7  8 <- 9   10 <- 11  12 <- 13  14 <- 15
  0 <---- 2       4 <---- 6       8 <----- 10        12 <----- 14
  0 <------------ 4               8 <--------------- 12
  0 <---------------------------- 8

so that it is:

  0 <- 1 <- 2 <- 3 <- 4 <- 5 <- 6 <- 7 <- 8 <- 9 <- 10 <- 11 <- 12 ....

The space requirement for that row would be comperable to, but most
likely smaller than the sum of the sizes of the arrows in your
diagram.  (Because a skip-delta will tend to be larger than the first
single-step arrow it summarizes.  Indeed, "deep" skip-deltas (0<--8
being the deepest in your diagram), on busy files, will tend to be as
large as a full-text copy!  A lot of "shallow" skip-deltas will tend
to be twice the size of the next-shallowest (logical) delta for the
same revision.)

So if you cached the other arrows in your diagram, and the cache is
given roughly the same amount of disk space as is needed for all the
arrows in your diagram, then the cache will function as a memo, and
you'll have the same performance guarantees that you have now.
(And admins could make a fine grain space/time trade-off from there.)

So, naively, it sounds like (worst case) O(2x) the total disk space
(compared to your approach), a best case (1x) the space, with
sufficient caching/memoizing to yield the same performance.  But I
wonder if other factors and benefits wouldn't (a) reduce the space in
the direction of the "best case" scenario; (b) actually yield _better_
performance; (c) make the caching approach winning anyway.

For example, as with your approach, this could easily (dramatically)
reduce the growth rate of write-ahead logs (which also contributes to
write-txn throughput).  The caching/memoizing approach would make it
easier, I think, to be computing skip-deltas concurrently and
opportunistically.  Read-only mirrors and clients could be building
their own caches.  Given the kinds of access needed to "row 0",
perhaps it could be stored more compactly and updated more quickly.  A
cache could easily revert to full-texts when skip-deltas become
irrational.  In a long-lived, busy archive, I'm not so sure the
performance guarantee you are after is really worth more than a
guarantee of _better_ performance where the access patterns are hot --
and the caching approach can make that trade-off smoothly (if it takes
me 3x as long to retrieve a 7-year old kernel that nobody else has
fetched for years, but 1/2 as long to get last months ....).  The
simpler row-0 representation, being the only txnal data needed, can
help stabilize the implementation of the storage manager and simplify
its design.  The cache implementation could be made independent of the
primary storage manager internals -- making it easier to write new
storage manager implementations (and new caching-engine
implementations).

In short, when you're looking at Big-O space and time complexities, it
comes out the same whether the skip-deltas are part of the primary
data store, or part of external caches and memos.   It's all the
secondary effects of the external approach that can make it the more
winning technique.

-t

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Greg Hudson <gh...@MIT.EDU>.

On Sun, 2003-07-06 at 00:09, Brad Appleton wrote:
> > using the skip-delta technique to ensure that...

> Is this the same thing as the interleaved (or "inline", or sometimes
> called "interlaced") delta approach that is used by ClearCase and SCCS
> (among others?). It works basically like a great big  bunch of
> "#ifdefs" for each delta. If not, can you say more about how it
> differs from the inline delta approach?

It's not a weave, no.  A weave doesn't scale particularly well when
there are many revisions of a file.

I'll reiterate the new idea.  For any n>0, let D(n) be n-(1<<m) where
(1<<m) is the least significant bit set in n.  (D(1) is 0, D(8) is 8,
D(13) is 12, D(20) is 16.)  We store rev 0 of a file as plaintext (or
self-compressed), and rev n of a file as a delta against rev D(n).  The
first 16 revs of a file would look like:

  0 <- 1  2 <- 3  4 <- 5  6 <- 7  8 <- 9   10 <- 11  12 <- 13  14 <- 15
  0 <---- 2       4 <---- 6       8 <----- 10        12 <----- 14
  0 <------------ 4               8 <--------------- 12
  0 <---------------------------- 8

So to fetch the contents of rev 15 of the file, you'd see that rev 15 is
a delta against rev 14, which is a delta against rev 12, which is a
delta against rev 8, which is a delta against rev 0.  You combine those
four deltas using the delta combiner, apply them to rev 0, and get your
result.

(What we do right now is similar to the above, but in reverse, so that
older revs are expressed as deltas against newer revs.  The last section
of http://web.mit.edu/ghudson/thoughts/file-versioning contains a
description of skip-deltas as we use them today.)

> I think the inline approach makes it easier than the reverse-delta (or
> RCS style with reverse-deltas on the main trunk and forward-delta on
> branches) when it comes down to doing merge-algebra in an intelligent
> fashion.

How?

> Once upon a time, back in the mid-80s I worked on a commercial VC tool
> called SVM (Seidl Version Manager). Where were one of the few VC tools
> for the PC (other main ones being PVCS, TLIB, and a relative newcomer
> named MKS ;-). Ours was the only one of that bunch that used
> forward-deltas. We overcame the performance issue by keeping a cache
> of the most recent N versions in a cache area (where N could be
> configured per system, as could the set of files for which to keep the
> cache - if I recall correctly anyway :)

Caching fails to ensure guaranteed acceptable performance, which I think
is an important property of an FS back end.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Brad Appleton <br...@bradapp.net>.

On Sat, Jul 05, 2003 at 04:40:19AM -0400, Greg Hudson wrote:
> axiomatic that the most recent rev of a file should be stored in plain
> text, and earlier revs should be stored as deltas against it, so that
> 
> But maybe that's not necessary, with those advanced techniques.  Maybe
> we should just store new revs as deltas against older revs, using the
> skip-delta technique to ensure that you only need to go through
> O(lg(n)) deltas to get to any revision.

Is this the same thing as the interleaved (or "inline", or sometimes called "interlaced") delta approach that is used by ClearCase and SCCS (among others?). It works basically like a great big  bunch of "#ifdefs" for each delta. If not, can you say more about how it differs from the inline delta approach?

I think the inline approach makes it easier than the reverse-delta (or RCS style with reverse-deltas on the main trunk and forward-delta on branches) when it comes down to doing merge-algebra in an intelligent fashion.

> Cons:
>  
>   * Checkouts become slower--not by a huge factor, but maybe by a
>     noticeable constant factor.  For most repositories, checkouts are
>     much more common than checkins, so this is a big deal.  (However,
>     it's easier to make a simpler system perform well, so the tradeoff
>     may not be so obvious.)

Once upon a time, back in the mid-80s I worked on a commercial VC tool called SVM (Seidl Version Manager). Where were one of the few VC tools for the PC (other main ones being PVCS, TLIB, and a relative newcomer named MKS ;-). Ours was the only one of that bunch that used forward-deltas. We overcame the performance issue by keeping a cache of the most recent N versions in a cache area (where N could be configured per system, as could the set of files for which to keep the cache - if I recall correctly anyway :)

>   * Balancing space and time is harder for the all-important first few
>     revs.  But I think it's not too bad for normal workloads.

-- 
Brad Appleton <br...@bradapp.net> www.bradapp.net
  Software CM Patterns (www.scmpatterns.com)
   Effective Teamwork, Practical Integration
"And miles to go before I sleep." -- Robert Frost

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: A forward-thinking thought about deltas in FS design

Posted by Jack Repenning <jr...@collab.net>.

At 4:40 AM -0400 7/5/03, Greg Hudson wrote:
>We've always taken it as
>axiomatic that the most recent rev of a file should be stored in plain
>text, and earlier revs should be stored as deltas against it, so that
>checkouts of the most recent rev are fast.  That's what CVS does (on
>the mainline, anyway), and it was certainly important back when we had
>no delta combiner and no skip-deltas.

Well, CVS does that because RCS does that.  And RCS does that because 
Tichy published a paper showing that it was faster for retrieving the 
HEAD revision on files of around 200 lines (with a "lemma" survey 
showing that this was the average or typical file size at the time). 
On that basis, he implemented RCS that way, and it got a lot of 
academic macho points and took over.  A few years later, people 
noticed that 200 lines was no longer typical (on my project, the 50th 
percentile is at 29682 lines!), and that "the RCS advantage" was 
illusory at such sizes.  Commercial developers also have much less 
allegiance to HEAD, being into things like supporting older versions, 
patching patches, and working on isolation branches (which really 
kills the RCS assumptions).

So, yeah, I like where you're headed, Greg!
-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835-8090

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org