You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Jim Blandy <ji...@zwingli.cygnus.com> on 2000/11/13 22:37:03 UTC

random access delta streams

I'm thinking about storing really, really large objects in the
Subversion repository.  How hard would it be to implement seeking on
the result of applying a seekable text delta to some seekable stream?


Roughly:


/* A type representing a big, seekable, read-only stream.  */
typedef struct seekable_t seekable_t;


/* Read LEN bytes of data at offset POS in STREAM, and store them in
   BUF.  */
svn_error_t *seekable_read (seekable_t *stream,
                            apr_off_t pos,
                            char *buf,
                            apr_size_t len);


/* Set *TARGET_P to a seekable stream whose contents are the result of
   applying the svndelta stream DELTA to the base stream SOURCE.  */
svn_error_t *seekable_apply_delta (seekable_t *source,
                                   seekable_t *delta);



So if the Subversion filesystem has some 2-gigabyte file, with one
revision in fulltext and others represented as chains of deltas, I
could start with the fulltext and chain together stuff with
seekable_apply_delta, and actually get random access.

Is this even practical?  Do I (or, does Greg S.) really have to spew
the whole 2 gig out into a temporary file to provide random access to
these things?

Re: random access delta streams

Posted by Greg Stein <gs...@lyra.org>.
I'm not truly concerned with old revisions so much as random access on the
HEAD. The Reader is going to have a rather difficult time, in any case,
finding the right URL for older revisions.

Sorry, I realize that I confused my "today" recommendation with the
seekable-delta stuff. So to restate: "seekable revisions (old or new) are
great/necessary; efficient seeking of HEAD is probably a MUST; efficient
seeking of prior revisions is a MAY or SHOULD."

Cheers,
-g

On Tue, Nov 14, 2000 at 11:14:33AM -0600, Karl Fogel wrote:
> I think I agree with Branko on this one.
> 
> Certainly, it is true that if we implement random access on old files
> in a groovy way, then you can point your Acrobat Reader (or whatever)
> at an old version of a huge doc and it will be retrieved efficiently.
> 
> But if we don't spend the time on the grooviness right now, you will
> still be able to point your Acrobat Reader and read the old doc, it
> just won't be retrieved as quickly.
> 
> There's nothing preventing us making the change later, if it turns out
> to be needed.  It would be premature optimization to spend a lot of
> time on it now (unless Jim thinks it's not any significant extra work;
> but it sounds like it is, so I'm going on that assumption).  Let's
> just get things working first.
> 
> -K
> 
> Greg Stein <gs...@lyra.org> writes:
> > On Tue, Nov 14, 2000 at 01:52:16PM -0500, Greg Hudson wrote:
> > > > Acrobat Reader. Today. I'd suggest the time is right :-)
> > > 
> > > Let's have a little perspective here.  Being able to efficiently run
> > > Adobe Acrobat Reader directly against versions of files in the
> > > repository is a good thing, but it's not a goal for Subversion 1.0.
> > 
> > Perspective? I gave a concrete example, not an exhaustive list.
> > 
> > And I hope that people *have* realized that a person can just point their
> > web browser at a Subversion repository and see the HEAD branch. Hell, you
> > can apply version control to your web site, and users can continue to just
> > browse through it like they always do. And that browsing could certainly
> > include the storage the PDF documents.
> > 
> > Cheers,
> > -g
> > 
> > -- 
> > Greg Stein, http://www.lyra.org/

-- 
Greg Stein, http://www.lyra.org/

Re: random access delta streams

Posted by Karl Fogel <kf...@galois.collab.net>.
I think I agree with Branko on this one.

Certainly, it is true that if we implement random access on old files
in a groovy way, then you can point your Acrobat Reader (or whatever)
at an old version of a huge doc and it will be retrieved efficiently.

But if we don't spend the time on the grooviness right now, you will
still be able to point your Acrobat Reader and read the old doc, it
just won't be retrieved as quickly.

There's nothing preventing us making the change later, if it turns out
to be needed.  It would be premature optimization to spend a lot of
time on it now (unless Jim thinks it's not any significant extra work;
but it sounds like it is, so I'm going on that assumption).  Let's
just get things working first.

-K

Greg Stein <gs...@lyra.org> writes:
> On Tue, Nov 14, 2000 at 01:52:16PM -0500, Greg Hudson wrote:
> > > Acrobat Reader. Today. I'd suggest the time is right :-)
> > 
> > Let's have a little perspective here.  Being able to efficiently run
> > Adobe Acrobat Reader directly against versions of files in the
> > repository is a good thing, but it's not a goal for Subversion 1.0.
> 
> Perspective? I gave a concrete example, not an exhaustive list.
> 
> And I hope that people *have* realized that a person can just point their
> web browser at a Subversion repository and see the HEAD branch. Hell, you
> can apply version control to your web site, and users can continue to just
> browse through it like they always do. And that browsing could certainly
> include the storage the PDF documents.
> 
> Cheers,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/

Re: random access delta streams

Posted by Greg Stein <gs...@lyra.org>.
On Tue, Nov 14, 2000 at 01:52:16PM -0500, Greg Hudson wrote:
> > Acrobat Reader. Today. I'd suggest the time is right :-)
> 
> Let's have a little perspective here.  Being able to efficiently run
> Adobe Acrobat Reader directly against versions of files in the
> repository is a good thing, but it's not a goal for Subversion 1.0.

Perspective? I gave a concrete example, not an exhaustive list.

And I hope that people *have* realized that a person can just point their
web browser at a Subversion repository and see the HEAD branch. Hell, you
can apply version control to your web site, and users can continue to just
browse through it like they always do. And that browsing could certainly
include the storage the PDF documents.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: random access delta streams

Posted by Greg Stein <gs...@lyra.org>.
On Tue, Nov 14, 2000 at 07:05:21PM +0100, Branko Cibej wrote:
> Jim Blandy wrote:
>...
> >> Another question: when do you need random access to old versions?
> >> And if you do, wouldn't it make more sense to store them in the
> >> clear, anyway, without having to apply a series of deltas every
> >> time? That's what any caching scheme will boil down to, anyway.
> > 
> > Greg S. asked some mild questions, and I'm inferring from there.
> > The thought arose from the intersection of these issues:
> > - Subversion wants to use deltas to store multiple versions of files
> >   in a compact way.
> > - Subversion wants to provide a uniform interface for accessing both
> >   old and new versions of nodes.
> > - HTTP or DAV has some way to ask about substrings of files.
> 
> Oooh, that's right: HTTP/1.1 let's you pick a file to pieces.

Concretely: the Adobe Acrobat (PDF) Reader plugin will use ranges when it
detects a slow line. It will grab everything for the first page, then go
back and start fetching the rest of the document. [fast lines just suck in
the whole doc]

The subrange request also happens to apply to PUT, but that would probably
apply to a "working resource" (no delta, no compression). And/or we can
disallow subrange PUT in mod_dav_svn. I'm fine with deferring this one for a
bit (although the DAV filesystem in MacOS X might do subrange PUT, but it
won't really work against SVN since we don't have DeltaV "auto-versioning").

Regarding large files: the Oracle folks are doing a backend for mod_dav 1.0
specifically so they can store large media files. The work is coming from
their Intermedia group, and they're looking at multi-megabyte (even
gigabyte) documents. (and yes, they also plan to do versioning with those
buggers) With documents of that size, subrange fetching is very important.

> > I have no compelling application in mind; I'm just trying to meet all
> > the promises we're making.
> 
> Well, personally I think it would be a great feature to have, and I've a 
> hunch that it could be a huge performance gain in some situations. I've 
> no feeling about /when/ we'll need it. I suggest you add this to IDEAS, 
> and we can pull it out of there when the time's right..

Acrobat Reader. Today. I'd suggest the time is right :-)

[ interesting note: we've been using the Reader as a test case for Apache
  2.0's updated byterange handling mechanisms ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Re: random access delta streams

Posted by Branko Čibej <br...@xbc.nu>.
Jim Blandy wrote:

>> It should be possible, i guess. Right now, our delta format doesn't
>> allow such random-access applying of deltas simply because the windows
>> aren't indexed. You'd have to store a window lookup table with the
>> delta in order to convert seek offsets appropriate windows.
> 
> 
> Yep.  Just on the techno-wanking side, I was thinking of a
> hierarchical index: after every tenth window, store a little table of
> pointers to the windows.  After every tenth little table, store a
> little table of pointers to little tables of pointers to windows.
> After every tenth... you see what I mean.  Then you can start at the
> end of the file and find any window in logarithmic time.  That
> procedure always emits balanced trees.  Use 100 instead of ten, and
> your tree has 1/100th the number of nodes as your file has windows,
> and is very shallow.

Aha, you're describing a B+tree of delta windows. Talk about a database 
of compressed databases. :-)

>> Another question: when do you need random access to old versions?
>> And if you do, wouldn't it make more sense to store them in the
>> clear, anyway, without having to apply a series of deltas every
>> time? That's what any caching scheme will boil down to, anyway.
> 
> 
> Greg S. asked some mild questions, and I'm inferring from there.
> The thought arose from the intersection of these issues:
> - Subversion wants to use deltas to store multiple versions of files
>   in a compact way.
> - Subversion wants to provide a uniform interface for accessing both
>   old and new versions of nodes.
> - HTTP or DAV has some way to ask about substrings of files.

Oooh, that's right: HTTP/1.1 let's you pick a file to pieces.

> I have no compelling application in mind; I'm just trying to meet all
> the promises we're making.

Well, personally I think it would be a great feature to have, and I've a 
hunch that it could be a huge performance gain in some situations. I've 
no feeling about /when/ we'll need it. I suggest you add this to IDEAS, 
and we can pull it out of there when the time's right..

-- 
Brane �ibej
    home:   <br...@xbc.nu>             http://www.xbc.nu/brane/
    work:   <br...@hermes.si>   http://www.hermes-softlab.com/
     ACM:   <br...@acm.org>            http://www.acm.org/

Re: random access delta streams

Posted by Jim Blandy <ji...@zwingli.cygnus.com>.
> It should be possible, i guess. Right now, our delta format doesn't
> allow such random-access applying of deltas simply because the windows
> aren't indexed. You'd have to store a window lookup table with the
> delta in order to convert seek offsets appropriate windows.

Yep.  Just on the techno-wanking side, I was thinking of a
hierarchical index: after every tenth window, store a little table of
pointers to the windows.  After every tenth little table, store a
little table of pointers to little tables of pointers to windows.
After every tenth... you see what I mean.  Then you can start at the
end of the file and find any window in logarithmic time.  That
procedure always emits balanced trees.  Use 100 instead of ten, and
your tree has 1/100th the number of nodes as your file has windows,
and is very shallow.

You're right that locality is a big concern --- I wonder how much
locality this will have.

> I'm wondering, though: when would you want to put such a large file
> under revision control? Typically it'll be a database, which has 
> its own consistency and versioning mechanism, or a media file
> (e.g., video), which is streamy by nature. I'm not saying it doesn't
> make sense to store large files, I only think that the applications
> for that are relatively few.
>
> Another question: when do you need random access to old versions?
> And if you do, wouldn't it make more sense to store them in the
> clear, anyway, without having to apply a series of deltas every
> time? That's what any caching scheme will boil down to, anyway.

Greg S. asked some mild questions, and I'm inferring from there.
The thought arose from the intersection of these issues:
- Subversion wants to use deltas to store multiple versions of files
  in a compact way.
- Subversion wants to provide a uniform interface for accessing both
  old and new versions of nodes.
- HTTP or DAV has some way to ask about substrings of files.

I have no compelling application in mind; I'm just trying to meet all
the promises we're making.

Re: random access delta streams

Posted by Branko Čibej <br...@hermes.si>.
Jim Blandy wrote:
> So if the Subversion filesystem has some 2-gigabyte file, with one
> revision in fulltext and others represented as chains of deltas, I
> could start with the fulltext and chain together stuff with
> seekable_apply_delta, and actually get random access.
> 
> Is this even practical?  Do I (or, does Greg S.) really have to spew
> the whole 2 gig out into a temporary file to provide random access to
> these things?

It should be possible, i guess. Right now, our delta format doesn't
allow such random-access applying of deltas simply because the windows
aren't indexed. You'd have to store a window lookup table with the
delta in order to convert seek offsets appropriate windows.

But don't forget that a window can describe differences relative to
existing target, not source data. A pure compression would be a
stream of such windows, so when seeking you'd have to decompress
from the beginning of the file up to the seek offset. (SVN doesn't
create such deltas now, although it would take only a small change).

I'm wondering, though: when would you want to put such a large file
under revision control? Typically it'll be a database, which has 
its own consistency and versioning mechanism, or a media file
(e.g., video), which is streamy by nature. I'm not saying it doesn't
make sense to store large files, I only think that the applications
for that are relatively few.

Another question: when do you need random access to old versions?
And if you do, wouldn't it make more sense to store them in the
clear, anyway, without having to apply a series of deltas every
time? That's what any caching scheme will boil down to, anyway.

-- 
Branko Čibej                 <br...@hermes.si>
HERMES SoftLab, Litijska 51, 1000 Ljubljana, Slovenia
voice: (+386 1) 586 53 49     fax: (+386 1) 586 52 70