You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Branko Čibej <br...@hermes.si> on 2000/08/09 17:49:50 UTC

[patch] First stab at text deltas

I wanted to get some hands-on experience with the text delta interface,
so I've gone and done a very primitive implementation, and put in a
test driver. I haven't touched vcdiff here yet, nor vdelta (although
we _do_ generate valid deltas, as you'll see :-).

I made a small change to the interface -- we need an apr_pool_t* in
svn_delta_window_t. Right now it's used to allocate the window and
dependent structures, and its a sub-pool of the pool used by the
delta stream.

I'm still thinking about how to structure this stuff, so please don't
be too picky. (This isn't even the foundation yet -- rather, it's the
hole into which the foundation will be laid. :-)

    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: [patch] First stab at text deltas

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
Actually, the arguments in my previous message apply to consuming
deltas, but not so much to producing them (which is what I assume
you've been working on).  Computing a delta requires two separate
incoming data streams, which is pretty impossible to do in a
caller-pushes fashion.

I think only the interface to svn_vcdiff_to_delta needs to change.
svn_delta_to_vcdiff is more useful for transmission, so it's better
for it to be caller-pulls anyway.

Re: [patch] First stab at text deltas

Posted by Branko Čibej <br...@hermes.si>.
Jim Blandy wrote:
> Perhaps this is more explanation than needed, but I didn't want you to
> feel like we were just jerking you around, and changing interfaces by
> whim.

That's O.K. I'm working on the core stuff right now, anyway.
Besides, I'll be away for a week now, and I _expect_ things to 
change drastically in so long a time :-)

    Brane


-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: [patch] First stab at text deltas

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
We recognized yesterday that the text delta interface needs to be
turned upside down.  Here's the general issue:

When you're writing single-threaded C code to handle data in a
stream-like fashion, there are two approaches you can take:

- `caller pulls': the function that consumes the data calls the
  function that produces the data.  The producer passes the data to
  the consumer via return values, by modifying parameters passed by
  reference, etc.  The current text delta interface is in this style,
  since one calls `svn_next_delta_window' to get the next chunk of
  instructions.  But `getchar' and `fread' are simpler examples.

- `caller pushes': the function that produces the data calls the
  function that consumes the data.  The producer passes the data to
  the consumer via ordinary parameters.  The Expat XML parser
  interface is in this style: you call XML_Parse with a chunk of text
  to parse, and it invokes your callbacks on each significant thing it
  finds.

They both work, but they don't mix well: you can't take a `caller
pushes' component and use it to produce data for a `caller pulls'
component, without giving up your streaming behavior.

For example, suppose we're using Expat to parse an XML stream that
containins text delta data.  (Contrived, I know.)  The *tree* delta
parser invokes Expat, which finds some vcdiff data, and invokes its
character data handler, which wants to call the *text* delta parser.
However, it can't provide an appropriate read_fn, because any further
data that function may need is still in Expat's internal buffer.  And
you can't call Expat recursively, because Expat may find the end of
the vcdiff data, and want to invoke its StartElementHandler and
EndElementHandler, even though the tree delta parser is still waiting
for its character data handler to return.  ("Don't call us, we'll call
you.")

It could be made to work by having Expat run in a separate thread,
interacting with the text delta code in a coroutine-ish kind of way,
but making the Subversion delta library require thread support is a
lose.

It's not a huge problem: we just invert the text delta interface to be
`caller pushes' instead of `caller pulls'.  Event-driven and
`select'-driven environments are more friendly to pushing anyway, so
this is probably a good thing.

Perhaps this is more explanation than needed, but I didn't want you to
feel like we were just jerking you around, and changing interfaces by
whim.  If you see some clean way to make things work together, let me
know, but I think it's simplest just to change the text delta
interface.  Probably you'll be able to share a lot of code --- I hope
so, at least.

Re: [patch] First stab at text deltas

Posted by Branko Čibej <br...@hermes.si>.
Oh, BTW: Don't expect to hear from me until next Monday, when I'll
be back from vacations and ready to dig into vcdiff.

    Brane

Re: [patch] First stab at text deltas

Posted by Branko Čibej <br...@hermes.si>.
O.K., here's the second stab. We now have a working vdelta generator,
with windowing, ain't that nice!

    subversion/include/svn_delta.h:
        Add ops_size and pool to svn_delta_window_t.

    subversion/libsvn_delta/Makefile.am:
        Add delta_stream.c vdelta.c to libsvn_delta_la_SOURCES.
    subversion/libsvn_delta/delta_stream.h: New, private header.
        Define svn_delta_stream_t and friends.
    subversion/libsvn_delta/delta_stream.c: New.
        Implementation of the public text delta interface
        and parts of the private interface.
    subversion/libsvn_delta/vdelta.c: Core vdelta generator.

    subversion/libsvn_delta/tests/Makefile.am: New.
    subversion/libsvn_delta/tests/text-delta.c: New.
        Test driver for generating text deltas.


    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: [patch] First stab at text deltas

Posted by Branko Čibej <br...@hermes.si>.
Jim Blandy wrote:
> Experimenting with windowing algorithms sounds like a lot of fun.
> Obviously, one wants to have the two windows slide along so as to have
> the greatest degree of resemblance between the source and target.
> Deltas for huge files with small local changes should still be small,
> but if your windowing is not ideal, they could still end up being
> proportional to the overall size of the file.

I tested this by comparing a Makefile with its Makefile.in. When the
window size is about 10% of either file, the delta is about 2.5 times
larger than the on created with a window larger than both files.
(Note that the delta sizes are just an approximation based on the
number and kind of ops in our internal representation; I'm not
creating vcdiffs yet.)

It's interesting to note that the larger delta was about 12% of the file
size, a bit larger than a `diff -n`. Of course, this isn't really a
representative test, anyway.

In any case, we'll be able to play around at least with window sizes
now, at least.

    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: [patch] First stab at text deltas

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
> > I believe that svn_tdelta__winfo_t's data_offset is always equal to
> > the window's new->len. If you eliminate that variable, you can get
> > rid of svn_tdelta__winfo_t altogether.  That would simplify your code
> 
> Indeed it is, but that would mean exposing the implementation details
> of svn_string_t. If that's O.K., I'll do that an remove svn_tdelta__winfo_t.

svn_string_t isn't supposed to be opaque --- it's transparent.  You're
supposed to be able to go into a string and tweak things arbitrarily.


> In fact, I already _do_ have hash tables and other state ...  I've
> just now finished the core vdelta driver, along with the dumb
> windowing method. I need to clean up the code a bit, so I'll have a
> patch ready sometime tomorrow.

Great!

Experimenting with windowing algorithms sounds like a lot of fun.
Obviously, one wants to have the two windows slide along so as to have
the greatest degree of resemblance between the source and target.
Deltas for huge files with small local changes should still be small,
but if your windowing is not ideal, they could still end up being
proportional to the overall size of the file.

Re: [patch] First stab at text deltas

Posted by Branko Čibej <br...@hermes.si>.
Jim Blandy wrote:
> 
> The code looks good so far.
> 
> It's fine with me to move svn_tdelta__winfo_t's ops_space into
> svn_delta_window_t.  That field is useful to anyone wanting to operate
> on windows, which seems like something lots of code might need.

That'd be better, yes.

> Whenever I'm working with an array which might have more space
> allocated than used, I use `num_' to indicate the number of elements
> actually in use, and `_size' to indicate the allocated size (to
> suggest `sizeof').  If that convention appeals to you, you could call
> the field ops_size.  But maybe num_ vs. _size is a weak distinction,
> making a poor convention.

I try to avoid using "size" for arrays when I mean length. Sometimes I
use _count and _capacity. Anyway, you're the author of the interface,
so I'll adapt to your style.
 
> I believe that svn_tdelta__winfo_t's data_offset is always equal to
> the window's new->len. If you eliminate that variable, you can get
> rid of svn_tdelta__winfo_t altogether.  That would simplify your code

Indeed it is, but that would mean exposing the implementation details
of svn_string_t. If that's O.K., I'll do that an remove svn_tdelta__winfo_t.

> 
> a bit.  But eventually you'll have hash tables and other state anyway,
> so it may not be important.

In fact, I already _do_ have hash tables and other state ...
I've just now finished the core vdelta driver, along with the dumb
windowing method. I need to clean up the code a bit, so I'll have a
patch ready sometime tomorrow.

Then I'm off for a week.


> I sense that you have a lispy background.  :)

Not really. I first saw this kind of syntax a long time ago when
I was playing around with TeX's virtual font descriptions, and hadn't
even heard of lisp. Sorry to disappoint you :-)
(Today, of course, I regularly hack at my .emacs file).


    Brane

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70

Re: [patch] First stab at text deltas

Posted by Jim Blandy <ji...@savonarola.red-bean.com>.
The code looks good so far.  

It's fine with me to move svn_tdelta__winfo_t's ops_space into
svn_delta_window_t.  That field is useful to anyone wanting to operate
on windows, which seems like something lots of code might need.

Whenever I'm working with an array which might have more space
allocated than used, I use `num_' to indicate the number of elements
actually in use, and `_size' to indicate the allocated size (to
suggest `sizeof').  If that convention appeals to you, you could call
the field ops_size.  But maybe num_ vs. _size is a weak distinction,
making a poor convention.

I believe that svn_tdelta__winfo_t's data_offset is always equal to
the window's new->len.  If you eliminate that variable, you can get
rid of svn_tdelta__winfo_t altogether.  That would simplify your code
a bit.  But eventually you'll have hash tables and other state anyway,
so it may not be important.

I sense that you have a lispy background.  :)