You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Daniel Berlin <db...@dberlin.org> on 2005/02/11 22:31:23 UTC

[PATCH]: Implement xdelta, use it by default

It turns out i forgot to extend matches backwards as far as possible as
well. Doing that buys us another 5-10% of space back, so the space used
for xdelta is only about 10-15% more.

This implementation is, on average, 5x faster than vdelta, compiled at
-O2.  In my testing, it is never slower (and never even close to as
slow :P)

The current implementation falls back to using vdelta when the source
data is null, in order to get self-compression like we currently do.


This patch passes all regression tests, and has been tested by loading
the entire gcc repo, redumping, and verifying everything still matches
in the dumpfiles :).

Add xdelta algorithm, and use it by default

* subversion/libsvn_delta/vdelta.c: Add apr_hash.h include.
(struct adler32): New structure.
(adler32_in): New function.
(adler32_out): Ditto.
(adler32_sum): Ditto.
(init_adler32): Ditto.
(struct match): New structure.
(init_matches_table): New function.
(find_match): Ditto.
(compute_delta): Ditto.
(NOOP, XDELTA, VDELTA): Macros to determine which delta algorithm to
use.
Default to xdelta.
(svn_txdelta__vdelta): Call the approriate delta generator.


I'm also happy to create a new xdelta.c file and use it in text_delta
instead if that is what is preferred instead of doing it this way.
:)


Re: [PATCH]: Implement xdelta, use it by default

Posted by Daniel Berlin <db...@dberlin.org>.
On Sat, 2005-02-12 at 00:22 +0000, Philip Martin wrote:
> Daniel Berlin <db...@dberlin.org> writes:
> 
> Just a few quick comments on the implementation:
> 
> > +/* This is pseudo-adler32, and xdelta, as monotone does it. The
> > +   adler32 has no prime modulus.  This is a direct translation of the
> > +   C++ code, hence the reason it looks a bit weird. */
> 
> monotone is GPL isn't it, can a "direct translation" be released under
> the Subversion licence?  I think Adler32 is widely used under a
> variety of licences, not so sure about xdelta.

Yes it is gpl, and xdelta is used under a wide variety of licenses.
Also, yes, you can release a direct translation under the subversion
license.

It's also not copyrightable in it's current state anyway, for various
reasons (literally just computes a mathemtical formula in the most
direct way possible).
His xdelta is just a translation of the algorithm in the xdelta paper
anyway.

It was translated from monotone, but that's because i didn't have
ghostview handy when i wrote it.

The paper describes it in basically the exact same terms (in terms of an
init_matches function, a compute_delta function, and a find_match
function that do exactly what we do).
In other words, this is a complete non-issue.

Regardless, for the purposes of removing all doubt, I have acquired
Graydon Hoare's (the original author of the untranslated C++ code in
question) specific permission to use and distribute the translated code
under the subversion license terms, and have noted the date and time he
gave it (this is sufficient for legal purposes, he'd be estopped from
claiming otherwise unless he can prove it didn't happen).  I'm sure
graydon would be happy to confirm this or whatever.



> We sort of target C89 and don't use inline, how about using APR_INLINE?

Already fixed in the attached new version.

I noticed that.

> > +  /* Extend the match forward as far as possible */
> > +  while ((apos + alen >= 0)
> > +	 && (bpos + badvance >= 0)
> 
> ../svn/subversion/libsvn_delta/vdelta.c: In function 'find_match':
> ../svn/subversion/libsvn_delta/vdelta.c:452: warning: comparison of unsigned expression >= 0 is always true
> ../svn/subversion/libsvn_delta/vdelta.c:453: warning: comparison of unsigned expression >= 0 is always true
Yes, fixed by removing the useless tests.

> > +
> > +/* Size of the blocks we compute checksums for.  */
> > +#define MATCH_BLOCKSIZE 64 
> 
> Add some rationale for 64 please, even "64 is a guess" or an explicit
> "64 is what monotone uses" would do.
Will do.

I've attached a new patch
I've also moved the routine into it's own file, xdelta.c, and made the
approriate changes to call the right routine from compute_window.

New log:

Add xdelta algorithm, and use it by default

* subversion/libsvn_delta/xdelta.c: New file, xdelta algorithm.
* subversion/libsvn_delta/delta.h: Add svn_txdelta__xdelta prototype.
* subversion/libsvn_delta/text_delta.c (compute_window): Default to
xdelta.


Re: [PATCH]: Implement xdelta, use it by default

Posted by Philip Martin <ph...@codematters.co.uk>.
Daniel Berlin <db...@dberlin.org> writes:

Just a few quick comments on the implementation:

> +/* This is pseudo-adler32, and xdelta, as monotone does it. The
> +   adler32 has no prime modulus.  This is a direct translation of the
> +   C++ code, hence the reason it looks a bit weird. */

monotone is GPL isn't it, can a "direct translation" be released under
the Subversion licence?  I think Adler32 is widely used under a
variety of licences, not so sure about xdelta.

> +#define adler32_mask 0x0000ffff
> +#define adler32_char_mask 0x000000ff
> +
> +/* Structure to store the state of our adler32 checksum.  */
> +struct adler32
> +{
> +  apr_uint32_t s1;
> +  apr_uint32_t s2;
> +  apr_uint32_t len;
> +  apr_uint32_t mask;
> +};
> +
> +/* Feed C into the adler32 checksum.  */
> +
> +static inline void

We sort of target C89 and don't use inline, how about using APR_INLINE?

> +adler32_in (struct adler32 *ad, const char c)
> +{
> +  ad->s1 += ((apr_uint32_t) (c)) & adler32_char_mask;
> +  ad->s1 &= adler32_mask;
> +  ad->s2 += ad->s1;
> +  ad->s2 &= adler32_mask;
> +  ad->len++;
> +}
> +

> +static void
> +find_match (apr_hash_t *matches,
> +	    const struct adler32 *rolling,
> +	    const char *a,
> +	    apr_uint32_t asize,
> +	    const char *b,
> +	    apr_uint32_t bsize,
> +	    apr_uint32_t bpos,
> +	    apr_uint32_t *aposp,
> +	    apr_uint32_t *alenp,
> +	    apr_uint32_t *badvancep,
> +	    svn_stringbuf_t **pending_insert)
> +{
> +  apr_uint32_t sum = adler32_sum (rolling);
> +  apr_uint32_t alen, badvance, apos;
> +  struct match *match;
> +  apr_uint32_t tpos, tlen;
> +
> +  /* See if we have a match.  */
> +  match = apr_hash_get (matches, &sum, sizeof (sum));  
> +  if (match == NULL)
> +    return;
> +  
> +  /* See where our match started.  */
> +  tpos = match->pos;
> +  tlen = match->len;
> +  
> +  /* Make sure it's not a false match.  */
> +  if (memcmp (a + tpos, b + bpos, tlen) != 0)
> +    return;
> +
> +  apos = tpos;
> +  alen = tlen;
> +  badvance = tlen;
> +
> +  /* Extend the match forward as far as possible */
> +  while ((apos + alen >= 0)
> +	 && (bpos + badvance >= 0)

../svn/subversion/libsvn_delta/vdelta.c: In function 'find_match':
../svn/subversion/libsvn_delta/vdelta.c:452: warning: comparison of unsigned expression >= 0 is always true
../svn/subversion/libsvn_delta/vdelta.c:453: warning: comparison of unsigned expression >= 0 is always true

> +	 && (apos + alen < asize)
> +	 && (bpos + badvance < bsize)
> +	 && (a[apos + alen] == b[bpos + badvance]))
> +    {
> +      ++alen;
> +      ++badvance;
> +    }
> +
> +  /* See if we can extend backwards into a previous insert hunk.  */
> +  if (*pending_insert)
> +    {
> +      while (apos > 0
> +	     && bpos > 0
> +	     && a[apos - 1] == b[bpos - 1]
> +	     && (*pending_insert)->len != 0)
> +	{
> +	  svn_stringbuf_chop (*pending_insert, 1);
> +	  --apos;
> +	  --bpos;
> +	  ++alen;
> +	}
> +      /* If we completely consumed the entire insert, delete it.  */
> +      if ((*pending_insert)->len == 0)
> +	*pending_insert = NULL;
> +    }
> +
> +  *aposp = apos;
> +  *alenp = alen;
> +  *badvancep = badvance;
> +} 
> +
> +/* Size of the blocks we compute checksums for.  */
> +#define MATCH_BLOCKSIZE 64 

Add some rationale for 64 please, even "64 is a guess" or an explicit
"64 is what monotone uses" would do.

> +
> +
> +/* Compute a delta from A to B using xdelta.  
> +
> +The basic xdelta algorithm is as follows:
> +
> +1. Go through the source data, checksumming every BLOCKSIZE block of bytes

MATCH_BLOCKSIZE

> +   using adler32,   and inserting the checksum into a match table with the
> +   position of the match.
> +2. Go through the target byte by byte, seeing if that byte starts a match that
> +   we have in the match table.
> +   2a. If so, try to extend the match as far as possible both forwards and
> +   backwards, and then insert a source copy operation into the delta ops
> +   builder for the match.
> +   2b. If not, insert the byte as new data using an insert delta op.  
> +
> +Our implementation doesn't immediately insert "insert" operations, it waits
> +until we have another copy, or we are done.  The reasoning is twofold:
> +
> +1. Otherwise, we would just be building a ton of 1 byte insert operations
> +2. So that we can extend a source match backwards into a pending insert
> +operation, and possibly remove the need for the insert entirely.  This can
> +happen due to stream alignment.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org