You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@subversion.apache.org by Apache subversion Wiki <co...@subversion.apache.org> on 2012/09/19 01:30:26 UTC

[Subversion Wiki] Update of "StarDelta" by StefanFuhrmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.

The "StarDelta" page has been changed by StefanFuhrmann:
http://wiki.apache.org/subversion/StarDelta

Comment:
WIP. first part

New page:
= Star Deltas =

== Introduction ==

FSFS currently uses xdelta to store different version of the same node efficiently.
Basically, we represent node x_i as

      x_i = x_i-1 o \delta(x_i, x_i-1)
    x_i-1 = x_i-2 o \delta(x_i-1, x_i-2)
         ...
      x_0 = x_0

and store x_0 plus the incremental \delta information. x_i gets reconstructed by
starting with x_0 and iteratively applying all deltas. Assuming that size(x_i) is
roughly proportional to i and the deltas averaging around some constant value,
this approach has the following properties:

           storage size(N) = size(x_0) + sum_i=1^N(size(\delta(x_i, x_i-1)))
                           = O(N)
    reconstruction time(N) ~ size(N) + sum_i=1^N(size(x_i)))
                           = O(N^2)

with N being either the size of the node or the number of revisions to it.
To mitigate the quadratic runtime behavior, we use skip-deltas:

    x_i = x_k + \delta(x_i, x_base(i)) with base(i) being next "rounder" binary

Since we skip intermediate representations, we will repeat the respective change
information (approx .5 log N times). Storage size and reconstruction time are now

           storage size(N) = size(x_0) + sum_i=1^N(size(\delta(x_i, x_base(i))))
                           = O(N log N)
    reconstruction time(N) ~ size(N) + sum_log N(size(x_base)))
                           = O(N log N)

Please note that the actual implementation uses a hybrid scheme.

'''Oberservation.''' This approach does not cover forked node histories as they
are common with branches and merges. The same change can be merged into many
branches but may not result in the same node content (i.e. rep caching may not
kick in).  However, those changes themselves are very similar and often even
identical.

A potential solution might be to deltify deltas, i.e. to use a second-order
deltification scheme. Once that is done, even higher orders might be used.
It is unclear how this could be implemented in a meaningful way. Also, its
effectiveness depends heavily on the order in which branched deltas get
processed.


== Basic Goals ==

We would like to be able to deltify everyting against else. \delta* no longer
considers individual pairs of texts but rather deltifies the text against all
previously processed text.  Expected storage size, runtime size, writing and
reconstruction speed should all be O(N).

'''Note.''' Deltification is only a means to storage size reduction. We do not
attach any semantics to the deltification result. It is neither minimal nor
related to "svn delta" and friends.


== Data structure ==

The core data structure consists of two elements: the text buffer and the
instructions array. The text buffer is a single byte array containing the
various string fragments that will be combined to fulltexts according to
the instructions. The latter is a single array of (offset, count) pairs.
The offset may be prefixed (see below) but in its simpliest form, it will
copy COUNT bytes from text buffer starting at text buffer offset OFFSET.

Since we want to store more than a single fulltext ("string"), another
array will map the string index to the first instruction in the instructions
array:

{{attachment:star-delta-core.png|core data structure}}

== Preliminary results ==

Test data set: first revisions of fs_fs.c
6xx files
15x MB total
38x kB max. file size

Data sizes:
780 kB text body
270k instructions (1.x MB) without common sequence optimization
3xx kB on disk (using quick_pack)

Execution times:
.? s creation (xx MB / s)
.? s optimization (xx MB / s)
.? s compression
.? s total write time (xx MB/s data in, xx MB/s out to disk)

.? s load and extract (xx MB/s)
.? s extraxt (xx MB/s)
.? s total read time (xx MB/data out, xx MB/s in from disk)

When imported into an empty 1.8 repository:
1.2 MB revs
1.5 s for svnadmin verify

Re: [Subversion Wiki] Update of "StarDelta" by StefanFuhrmann

Posted by Daniel Shahaf <da...@elego.de>.
Stefan Fuhrmann wrote on Wed, Sep 19, 2012 at 12:12:48 +0200:
> On Wed, Sep 19, 2012 at 10:53 AM, Daniel Shahaf <da...@elego.de> wrote:
> 
> > Apache subversion Wiki wrote on Tue, Sep 18, 2012 at 23:30:26 -0000:
> > > Dear Wiki user,
> > >
> > > You have subscribed to a wiki page or wiki category on "Subversion Wiki"
> > for change notification.
> > >
> > > The "StarDelta" page has been changed by StefanFuhrmann:
> > > http://wiki.apache.org/subversion/StarDelta
> > >
> > > Comment:
> > > WIP. first part
> > >
> > > New page:
> > > = Star Deltas =
> > >
> > > == Introduction ==
> > >
> > > FSFS currently uses xdelta to store different version of the same node
> > efficiently.
> > > Basically, we represent node x_i as
> > >
> > >       x_i = x_i-1 o \delta(x_i, x_i-1)
> > >     x_i-1 = x_i-2 o \delta(x_i-1, x_i-2)
> > >          ...
> > >       x_0 = x_0
> > >
> > > and store x_0 plus the incremental \delta information. x_i gets
> > reconstructed by
> > > starting with x_0 and iteratively applying all deltas. Assuming that
> > size(x_i) is
> > > roughly proportional to i and the deltas averaging around some constant
> > value,
> >
> > This assumption means that every commit to a file increases its size by
> > 948 bytes (or some other constant number that depends only on the
> > node-id).
> 
> 
> This is not what it says! Get some coffee ;)
> 
> My assumptions are
> 
> * the average size of changes tends to be constant over time
>   (e.g. commit size and changed files per commit)
> * the average ratio of lines being changed ./. added ./. removed
>   is constant over time
> 
> The first assumption is certainly justified for code that cannot
> mature beyond some reasonable level. As long as you have
> an open set of requirements, the nature of patches does *not*
> change over time (but it may fluctuate between development
> and stabilization phases).
> 
> The second is based on the observation that a patch for a new
> feature is not very different from a complicate patch for some
> hard bug. The "growth" rate per change of a file is more a function
> of its coding style that its age. Once a feature has been
> implemented (larger changes, many insertions) and stabilized
> (smaller changes, mainly modifications), either the next feature
> or hard bug will start that cycle again.
> 

I'm wondering about the second assumption.  It was my gut feeling that
some patches will add a large amount of lines (say, new function) and
other patches will add N lines and remove N-n lines (n ≪ N).

> I don't think that's how software development (one use-case
> > of svn) works.  Do you have real world data to corroborate your
> > assumption?  Or perhaps a use case that would trigger such behaviour?
> >
> 
> Well, scroll down to the bottom of the wiki page. My hypothesis
> is definitely a good fit for fs_fs.c: 644 revs to get to 379k with
> an average size of 220k.

So?  Every sequence has an average.  If you want to corroborate your
assumption I'd want to be convinced that the values in the sequence are
not far from that average (220k bytes, in your case).

> Also, the main claim of O(N log N)
> space and time can be witnessed in the apache repo.

Fair enuogh; while that's not conclusive evidence that your assumptions
are correct, it certainly is a supporting evidence.

> My point is not to give an exact formula here as this would
> require tons of pointless formalisms. The key is the make
> the general O(N log N) behavior plausible to the reader - even
> if long term effects would suggest O(N^.8) instead of O(N)
> in my initial assumptions. And that N may be revisions just
> as well as file size.
> 

Sure, I'm not going to ask you for the exponents to three decimal
places, I only wanted to disagree with your assumptions (that preceded
any computation you did) --- for all I could tell, they were made (and
they don't sound unreasonable), but no attempt was made to check their
validity before further math and code were done on their basis.

> The thing is that my prototypic \Delta^\ast code has almost
> ideal time and space properties: O(N) and very close to the
> theoretical minimum.
> 

Great! :)   If it behaves well on common use-cases and is
correct/maintainable/etc, let's get it in already :) 

Cheers

Daniel

> -- Stefan^2.

Re: [Subversion Wiki] Update of "StarDelta" by StefanFuhrmann

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Sep 19, 2012 at 10:53 AM, Daniel Shahaf <da...@elego.de> wrote:

> Apache subversion Wiki wrote on Tue, Sep 18, 2012 at 23:30:26 -0000:
> > Dear Wiki user,
> >
> > You have subscribed to a wiki page or wiki category on "Subversion Wiki"
> for change notification.
> >
> > The "StarDelta" page has been changed by StefanFuhrmann:
> > http://wiki.apache.org/subversion/StarDelta
> >
> > Comment:
> > WIP. first part
> >
> > New page:
> > = Star Deltas =
> >
> > == Introduction ==
> >
> > FSFS currently uses xdelta to store different version of the same node
> efficiently.
> > Basically, we represent node x_i as
> >
> >       x_i = x_i-1 o \delta(x_i, x_i-1)
> >     x_i-1 = x_i-2 o \delta(x_i-1, x_i-2)
> >          ...
> >       x_0 = x_0
> >
> > and store x_0 plus the incremental \delta information. x_i gets
> reconstructed by
> > starting with x_0 and iteratively applying all deltas. Assuming that
> size(x_i) is
> > roughly proportional to i and the deltas averaging around some constant
> value,
>
> This assumption means that every commit to a file increases its size by
> 948 bytes (or some other constant number that depends only on the
> node-id).


This is not what it says! Get some coffee ;)

My assumptions are

* the average size of changes tends to be constant over time
  (e.g. commit size and changed files per commit)
* the average ratio of lines being changed ./. added ./. removed
  is constant over time

The first assumption is certainly justified for code that cannot
mature beyond some reasonable level. As long as you have
an open set of requirements, the nature of patches does *not*
change over time (but it may fluctuate between development
and stabilization phases).

The second is based on the observation that a patch for a new
feature is not very different from a complicate patch for some
hard bug. The "growth" rate per change of a file is more a function
of its coding style that its age. Once a feature has been
implemented (larger changes, many insertions) and stabilized
(smaller changes, mainly modifications), either the next feature
or hard bug will start that cycle again.

I don't think that's how software development (one use-case
> of svn) works.  Do you have real world data to corroborate your
> assumption?  Or perhaps a use case that would trigger such behaviour?
>

Well, scroll down to the bottom of the wiki page. My hypothesis
is definitely a good fit for fs_fs.c: 644 revs to get to 379k with
an average size of 220k. Also, the main claim of O(N log N)
space and time can be witnessed in the apache repo.

My point is not to give an exact formula here as this would
require tons of pointless formalisms. The key is the make
the general O(N log N) behavior plausible to the reader - even
if long term effects would suggest O(N^.8) instead of O(N)
in my initial assumptions. And that N may be revisions just
as well as file size.

The thing is that my prototypic \Delta^\ast code has almost
ideal time and space properties: O(N) and very close to the
theoretical minimum.

-- Stefan^2.

-- 
*

Join us this October at Subversion Live
2012<http://www.wandisco.com/svn-live-2012>
 for two days of best practice SVN training, networking, live demos,
committer meet and greet, and more! Space is limited, so get signed up
today<http://www.wandisco.com/svn-live-2012>
!
*

Re: [Subversion Wiki] Update of "StarDelta" by StefanFuhrmann

Posted by Daniel Shahaf <da...@elego.de>.
Apache subversion Wiki wrote on Tue, Sep 18, 2012 at 23:30:26 -0000:
> Dear Wiki user,
> 
> You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.
> 
> The "StarDelta" page has been changed by StefanFuhrmann:
> http://wiki.apache.org/subversion/StarDelta
> 
> Comment:
> WIP. first part
> 
> New page:
> = Star Deltas =
> 
> == Introduction ==
> 
> FSFS currently uses xdelta to store different version of the same node efficiently.
> Basically, we represent node x_i as
> 
>       x_i = x_i-1 o \delta(x_i, x_i-1)
>     x_i-1 = x_i-2 o \delta(x_i-1, x_i-2)
>          ...
>       x_0 = x_0
> 
> and store x_0 plus the incremental \delta information. x_i gets reconstructed by
> starting with x_0 and iteratively applying all deltas. Assuming that size(x_i) is
> roughly proportional to i and the deltas averaging around some constant value,

This assumption means that every commit to a file increases its size by
948 bytes (or some other constant number that depends only on the
node-id).  I don't think that's how software development (one use-case
of svn) works.  Do you have real world data to corroborate your
assumption?  Or perhaps a use case that would trigger such behaviour?