You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Hudson <gh...@mit.edu> on 2000/08/09 18:27:29 UTC

vdelta questions

First: what's up with our mailing list archives?  They seem to stop at
June 15.

I've been wrapping my head around the vdelta algorithm as described in
"Delta Algorithms: An Empirical Analysis".  I've done a very basic
implementation of the core block-move generation algorithm and have
run into an ambiguity.  Since at least one other developer stated an
intention to work on the binary diff task, I'm hoping someone else has
read up on the algorithm and might have an interpretation.

vdelta maintains a hash table of indexes keyed by the four bytes
starting at that index.  Do we think this hash table is supposed to be
chained, so that it preserves all indexes which have been entered, or
do you just replace an entry in the hash table if there was already
something there?  (Either because of a hash collision or because
you're entering an index with the same four-byte sequence as a
previous entry.)  The description of how comparisons are done does not
mention trying multiple possible matches and seeing which one yields
the longest match, but the last diagram in the example shows a "*" at
position 0 and 13 (both of which have "bcde" as the next four bytes),
and the text following the example says, "Note that the above matching
algorithm will actually find the longest match if indexes are kept for
every location in the string," which would only be true if multiple
match candidates were tried at each comparison.

This is purely an issue of delta performance.  vdelta is supposed to
generate diffs about as small as bdiff's and run much faster, and I'd
like to have a good guess as to whether that level performance (both
in terms of runtime and diff size) can be obtained with or without
trying multiple matches.  Obviously, it's much easier to implement the
algorithm if you simply store the most recent index at each hash table
location; the hash table becomes a trivial data structure and you only
have to check one match candidate at each byte position.

(One option is to go out and buy the _Practical Unix Software_ book,
which allegedly discusses vdelta according to the vcdiff draft, and
might contain an implementation I could refer to... although that
might have bad intellectual property consequences.  Does anyone have
this book, and can tell me how much vdelta stuff is in there?  Another
option is to ask Dave Korn, but I'd rather not take up his time before
exhausting the other options.)

Thanks.

Re: vdelta questions

Posted by Karl Fogel <kf...@galois.collab.net>.
Branko =?iso-8859-2?Q?=C8ibej?= <br...@hermes.si> writes:
> (That means we have two implementations right now -- yours and Karl's.
> And mine is on the way, too ...)

Mine is a very bare proof-of-concept and is far from being a working
vdelta generator.  I wouldn't pay any attention to it.

Re: vdelta questions

Posted by Branko Čibej <br...@hermes.si>.
Jim Blandy wrote:
> According to the VCDIFF draft:
> 
>     However, even with a memory-efficient and fast string matching
>     algorithm, the computing resource requirements can be prohibitive
>     for processing large files. The standard way to deal with this is
>     to partition input files into "windows" to be processed
>     separately. Here, except for some unpublished work by Vo, little
>     has been done on designing effective windowing schemes.  Current
>     techniques, including Vdelta, simply use windows of the same size
>     with corresponding addresses across source and target files.

This is the scheme I'm using right now. With large enough windows,
it shouldn't be such a problem.

>     String matching and windowing algorithms have large influence on
>     the compression rate of delta and compressed files.
> 
> So, you've got your work cut out for you. :)

Thanks. Here, we are, on the bleeding edge of technology ...

    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: vdelta questions

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
> I've been worrying about something else, though. vdelta expects to
> hash the whole source before starting to generate the delta. So where
> does windowing come into that? Do you take parallel sections of
> both source and target? Oh ... that would be O.K. for small differences,
> and for large unrelated files you'd just get "noisy" compression, right?

According to the VCDIFF draft:

    However, even with a memory-efficient and fast string matching
    algorithm, the computing resource requirements can be prohibitive
    for processing large files. The standard way to deal with this is
    to partition input files into "windows" to be processed
    separately. Here, except for some unpublished work by Vo, little
    has been done on designing effective windowing schemes.  Current
    techniques, including Vdelta, simply use windows of the same size
    with corresponding addresses across source and target files.

    String matching and windowing algorithms have large influence on
    the compression rate of delta and compressed files.

So, you've got your work cut out for you. :)

Re: vdelta questions

Posted by Branko Čibej <br...@hermes.si>.
Greg Hudson wrote:
> I've done a very basic
> implementation of the core block-move generation algorithm and have
> run into an ambiguity.

(That means we have two implementations right now -- yours and Karl's.
And mine is on the way, too ...)

> vdelta maintains a hash table of indexes keyed by the four bytes
> starting at that index.  Do we think this hash table is supposed to be
> chained, so that it preserves all indexes which have been entered, or
> do you just replace an entry in the hash table if there was already
> something there?  (Either because of a hash collision or because
> you're entering an index with the same four-byte sequence as a
> previous entry.)  The description of how comparisons are done does not
> mention trying multiple possible matches and seeing which one yields
> the longest match, but the last diagram in the example shows a "*" at
> position 0 and 13 (both of which have "bcde" as the next four bytes),
> and the text following the example says, "Note that the above matching
> algorithm will actually find the longest match if indexes are kept for
> every location in the string," which would only be true if multiple
> match candidates were tried at each comparison.

I think this quote from the paper is relevant:

   "For efficiency, vdelta relaxes the greedy parsing rule so that
    matching prefixes are not always maximally long."

That seems to mean that a fully-associative hash isn't required.
Also, the only place where insertions in the hash might be with
an existing key is (2)(a) in the processing pseudo-code. My guess
is that you're supposed to allow for hash collisions, but in the
case of a key collision, insert the newer index into the table.

> This is purely an issue of delta performance.

Right.

> vdelta is supposed to
> generate diffs about as small as bdiff's and run much faster, and I'd
> like to have a good guess as to whether that level performance (both
> in terms of runtime and diff size) can be obtained with or without
> trying multiple matches.  Obviously, it's much easier to implement the
> algorithm if you simply store the most recent index at each hash table
> location; the hash table becomes a trivial data structure and you only
> have to check one match candidate at each byte position.

Oh, I think it wouldn't be too hard to try all three possibilities,
then compare the results (for performance, memory consumption and
delta size).


I've been worrying about something else, though. vdelta expects to
hash the whole source before starting to generate the delta. So where
does windowing come into that? Do you take parallel sections of
both source and target? Oh ... that would be O.K. for small differences,
and for large unrelated files you'd just get "noisy" compression, right?


    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: vdelta questions

Posted by Karl Fogel <kf...@galois.collab.net>.
Karl Fogel <kf...@galois.collab.net> writes:
>    1.  Overwrite on every collision, or
>    2.  Keep most recent N matches (i.e., limit hash chain length to N), or 
>    3.  Keep all matches.

Hmm, I forgot option 1.5:

    1.5. Keep [one match|N matches] for every unique four-byte
         sequence seen, regardless of whether it hash-collides with
         other four-byte seqs.

This one might be good because at least it starts out the match
extender on the right foot, is easy to store tree-ishly for fast
lookup, and still keeps an upper bound on the size of the hash table.

Just a quick thought, ignore as necessary.

-K

Re: vdelta questions

Posted by Karl Fogel <kf...@galois.collab.net>.
Greg Hudson <gh...@mit.edu> writes:
> vdelta maintains a hash table of indexes keyed by the four bytes
> starting at that index.  Do we think this hash table is supposed to be
> chained, so that it preserves all indexes which have been entered, or
> do you just replace an entry in the hash table if there was already
> something there?  (Either because of a hash collision or because
> you're entering an index with the same four-byte sequence as a
> previous entry.)  The description of how comparisons are done does not
> mention trying multiple possible matches and seeing which one yields
> the longest match, but the last diagram in the example shows a "*" at
> position 0 and 13 (both of which have "bcde" as the next four bytes),
> and the text following the example says, "Note that the above matching
> algorithm will actually find the longest match if indexes are kept for
> every location in the string," which would only be true if multiple
> match candidates were tried at each comparison.

I don't know what the paper recommends, but it seems to me that you
have your choice of how much collided data to preserve.  You could 

   1.  Overwrite on every collision, or
   2.  Keep most recent N matches (i.e., limit hash chain length to N), or 
   3.  Keep all matches.

Clearly, options 2 and 3 imply trying at least some of the available
matches to find the longest -- no point in keeping that data if we're
not going to use it.  One could always try all available matches, or
else have some "good enough" heuristic to decide when enough have been
tried.

Oh.  As I read the rest of your mail, I see that you're already well
aware of the tradeoffs summarized above. :-)

Why not implement the simplest, fastest case first (option 1), since
it's pretty much a subset of cases 2 and 3, and see what kind of delta
size you get?  With a good test suite, comparing performance as you
try 2 and/or 3 should be easy.

> This is purely an issue of delta performance.  vdelta is supposed to
> generate diffs about as small as bdiff's and run much faster, and I'd
> like to have a good guess as to whether that level performance (both
> in terms of runtime and diff size) can be obtained with or without
> trying multiple matches.  Obviously, it's much easier to implement the
> algorithm if you simply store the most recent index at each hash table
> location; the hash table becomes a trivial data structure and you only
> have to check one match candidate at each byte position.
> 
> (One option is to go out and buy the _Practical Unix Software_ book,
> which allegedly discusses vdelta according to the vcdiff draft, and
> might contain an implementation I could refer to... although that
> might have bad intellectual property consequences.  Does anyone have
> this book, and can tell me how much vdelta stuff is in there?  Another
> option is to ask Dave Korn, but I'd rather not take up his time before
> exhausting the other options.)

As long as we don't use his actual code I think we're okay.  As far as
I know, there are no patent issues here (*shudder*).

-Karl

Re: vdelta questions

Posted by Brian Behlendorf <br...@collab.net>.
On Wed, 9 Aug 2000, Greg Hudson wrote:
> First: what's up with our mailing list archives?  They seem to stop at
> June 15.

Doh; we'll look into it.

	Brian