You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by kf...@collab.net on 2003/11/04 04:43:17 UTC

issue #1573: fs deltification causes delays

I'd like to discuss possible solutions to issue #1573.  From the
issue's description:

   If you add 5 bytes to a 256 meg file and commit, it takes many
   minutes for the svn_fs_merge() to return success, because it's
   deltifying the previous version of the file against the new
   version.

   Because this is happening as a 'builtin' part of a commit, it
   destroys svn's ability to commit changes to large files.  When
   operating over dav, neon times out waiting for the final 'MERGE'
   command to return success.  And for people using ra_svn, it's still
   not acceptable for users to wait many, many minutes for the commit
   to finish.

   The fact that the repository stores non-HEAD versions of files as
   deltas is an optimization (a deliberate space/time tradeoff) and an
   internal implementation.  We shouldn't be punishing users for this.

There are various proposed solutions in the issue.  But for now, I'd
like to talk just about solutions we can implement before 1.0 (i.e.,
before Beta, i.e., before 0.33 :-) ).  The two that seem most
realistic are:

   1. Prevent deltification on files over a certain size, but create
      some sort of out-of-band compression command -- something like
      'svnadmin deltify/compress/whatever' that a sysadmin or cron job
      can run during non-peak hours to reclaim disk space.

   2. Make svn_fs_merge() spawn a deltification thread (using APR
      threads) and return success immediately.  If the thread fails to
      deltify, it's not the end of the world: we simply don't get the
      disk-space savings.

(2) looks like a wonderful solution; the only thing I'm not sure of is
how to do it inside an Apache module.  Does anyone know?

I assume that (1) would involve a repository config option for the
file size.  Note also that we used to have an 'svnadmin deltify'
command and could easily get it back (see r3920), so (1) may not
actually be as much work as it looks like.  Those who don't want to
run the cron job would just set the size limit to infinity, and always
get deltification.

Insights, thoughts?

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Greg Stein <gs...@lyra.org>.
On Mon, Nov 03, 2003 at 10:43:17PM -0600, kfogel@collab.net wrote:
>...
>    2. Make svn_fs_merge() spawn a deltification thread (using APR
>       threads) and return success immediately.  If the thread fails to
>       deltify, it's not the end of the world: we simply don't get the
>       disk-space savings.
> 
> (2) looks like a wonderful solution; the only thing I'm not sure of is
> how to do it inside an Apache module.  Does anyone know?

"Don't do it"

If you spin up a thread, then return control to Apache, you could find the
process simply exiting out from underneath you. Needless to say, that
would be quite bad if it occurred within the BDB library...

Apache sometimes kills and restarts processes to clean up anything that it
might have missed in its normal set of cleanups. That will take down
threads, too, so you really want to avoid that.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Branko Čibej <br...@xbc.nu>.
Jack Repenning wrote:

> There might be other tunable parameters as well.  For example, based
> solely on experiments like trying to ziip and already-zipped file, I
> suspect that deltification of certain file types is both unusually
> expensive and unusually unproductive (zip files, for example).
> Encrypted files are even scarier, since it's an explicit goal of
> crypto that compressing the exact same file twice must produce a
> completely dissimilar ciphertext.  I posit a class of files or which
> deltification is unoptimal (perhaps actually deleterious).  Who are
> the deltification gurus on the list?  Has this question been considered?

Yes, this question has been considered (at least by me), but up to now
there have been no satisfactory solutions. At first guess, we should
avoid deltifying files that are compressed and/or encrypted (this would
include all sorts of image formats, for example). We should also have
the option of just compressing ("deltify against empty source") files in
formats that are known to behave badly under deltification.

Basically, we should have three storage methods:

    * store: just store the new data in the repository
    * compress: compress the new data (vdelta or zlib -- whichever is
      "better", for some definition of better)
    * deltify: what we're doing now.

What to do about a particular file should be based on its type (MIME or
otherwise), and per-repository configuration. Obviously, this means that
automatic svn:mime-type detection is a must if we want this to be
efficient. Happily, our FS schema wouldn't have to change to support
different storage methods (except for the introduction of new values for
the representation types), and these changes would be completely
transparent to the client.

Of course, a particular file's storage class might change during its
lifetime, and later on we may want to make it configurable on the
client, or at least propagate the storage info to the client (in
read-only properties?).


Two other issues we should consider:

1) Storage class (as opposed to storage method)
In some cases, users may prefer to store file contents in ordinary files
rather than in the repository. Hierarchic storage management comes to
mind, for example; it's very hard to store your 50-gig video files
off-line if they're ensconced the repository...

2) On-the-wire representation
If a file is just "store"d in the repository because it doesn't compress
well, it's a good guess that sending deltas over the wire (or piping
through mod_deflate) isn't the most efficient thing to do. The possible
transmission methods are the same as the storage methods, except that
the choice of representation depends not just on the way it's stored in
the repository, but also on the repository-access layer, link speed,
etc. I can imagine httpd configuration that guesses link speed based on
client IP, for example.


Anyway, I don't think any of the above has to be implemented before 1.0.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Jack Repenning <jr...@collab.net>.
At 10:43 PM -0600 11/3/03, kfogel@collab.net wrote:
>I'd like to discuss possible solutions to issue #1573.  From the
>issue's description:
>
>    If you add 5 bytes to a 256 meg file and commit, it takes many
>    minutes for the svn_fs_merge() to return success, because it's
>    deltifying the previous version of the file against the new
>    version.

The discussion on this point seems to be focused on the individual 
user experience.  That's an important point, I don't mean to derail 
that--and certainly "the operation fails on timeout" is a very 
important individual-experience issue!  But I'm also concerned about 
the performance impact on other users of the system: if this 
operation is so lengthy and resource-intensive, isn't it also 
clobbering the system for all other uses?  A large site needs to 
support multiple SVN users doing various things at any one time, and 
probably other stuff as well.  What are the implications of these 
ideas on total-system impact?

I rif on that a bit:

>    1. Prevent deltification on files over a certain size, but create
>       some sort of out-of-band compression command -- something like
>       'svnadmin deltify/compress/whatever' that a sysadmin or cron job
>       can run during non-peak hours to reclaim disk space.

The idea of rescheduling to off-peak hours is a good-citizen kind of 
thing.  But in these days of global development, there often aren't 
any "off-peak" hours.  What "not quite so peakish" hours can be found 
are generally over-subscribed with other admin activities already. 
And the relatively few users doing their work during whatever hours 
you choose to call "off-peak" are typically not happy with what they 
perceive as ghettoization.  So sysadmins of large sites are likely to 
be very cool to this idea.

The idea of batching the process is a good-citizen kind of thing, 
primarily because batched processes typically execute at reduced 
priority.  But this sort of arrangement is subject to catastrophic 
failure: if the backlog grows enough, then a new batch might be 
launched while an earlier one is still processing.  This is tricky to 
design for: you can't let them both begin processing the same files, 
that's wasted energy; you probably don't even want them both running 
at the same time, that's twice as much of this supposedly-unobtrusive 
processing competing with the foreground work.  Yet, you can't simply 
have the second thread quietly defer to the running thread and die, 
because this collision might actually arise from a bug in the code, 
that causes it to hang, or abort leaving deceptive droppings, or 
something along those lines.

Work in this direction would need to deal with these matters.  Have 
you had any thoughts along these lines?

>
>    2. Make svn_fs_merge() spawn a deltification thread (using APR
>       threads) and return success immediately.  If the thread fails to
>       deltify, it's not the end of the world: we simply don't get the
>       disk-space savings.

This approach is, so far as I can see, completely focused on the 
individual-user problem, and wholly unhelpful for the whole-site 
problem.  While, as I say, I agree that the individual-user problem 
needs to be addressed, so does the whole-site problem.

>I assume that (1) would involve a repository config option for the
>file size.

There might be other tunable parameters as well.  For example, based 
solely on experiments like trying to ziip and already-zipped file, I 
suspect that deltification of certain file types is both unusually 
expensive and unusually unproductive (zip files, for example). 
Encrypted files are even scarier, since it's an explicit goal of 
crypto that compressing the exact same file twice must produce a 
completely dissimilar ciphertext.  I posit a class of files or which 
deltification is unoptimal (perhaps actually deleterious).  Who are 
the deltification gurus on the list?  Has this question been 
considered?
-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835.8090
f: 650.228.2501

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
Jack Repenning <jr...@collab.net> writes:

> At 9:01 AM -0600 11/4/03, C. Michael Pilato wrote:
> >I say all this to promote the idea that it isn't too much to ask of a
> >repos administrator to run some out-of-process deltification routine
> >-- even per-commit -- because if they are truly concerned about disk
> >space, they'll already have some out-of-process log-file cleanup
> >process.  And if you have a cronjob/post-commit hook to cleanup
> >logfiles, what's an extra line in that script to deltify?
> 
> I must not understand this proposal.  Expressing my ignorance and
> confusion: how can the repo admin be made responsible for
> deltification, when deltas are needed for the next update of the new
> node-revision?  If I, as an admin, choose to defer all deltification
> to the wee small hours, does that mean no one can see the new
> node-revision until tomorrow?  Surely not!  So what *is* the idea
> here?

Hm... you don't really know how deltification works, do you? :-)
You're forgiven.

All data written to the filesystem by external processes is stored
full-text in the database.  After each commit is recorded, and as a
kindness to people's hard drives (in theory, anyway), the filesystem
code enumerates the set of node-revisions changed in that commit, and
replaces the full-text contents of one or more previous versions of
each of those node-revision with deltas against the new version of
that file.  At all times, both the new and previous versions are
accessible, and their contents are exposed via the FS API in a manner
that hides whether or not those contents are stored as fulltext or as
deltas against some other version.

In other words, if right now you skip the call to deltify_mutable() in
svn_fs_commit_txn(), you're repository will behave (from a user POV)
exactly as it does today -- except it will run quite a bit faster,
create fewer log files, and bloat in the 'strings' table.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Jack Repenning <jr...@collab.net>.
At 9:01 AM -0600 11/4/03, C. Michael Pilato wrote:
>I say all this to promote the idea that it isn't too much to ask of a
>repos administrator to run some out-of-process deltification routine
>-- even per-commit -- because if they are truly concerned about disk
>space, they'll already have some out-of-process log-file cleanup
>process.  And if you have a cronjob/post-commit hook to cleanup
>logfiles, what's an extra line in that script to deltify?

I must not understand this proposal.  Expressing my ignorance and 
confusion: how can the repo admin be made responsible for 
deltification, when deltas are needed for the next update of the new 
node-revision?  If I, as an admin, choose to defer all deltification 
to the wee small hours, does that mean no one can see the new 
node-revision until tomorrow?  Surely not!  So what *is* the idea 
here?
-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835.8090
f: 650.228.2501

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by kf...@collab.net.
John Szakmeister <jo...@szakmeister.net> writes:
> If I understand how this works correctly, this 'svnadmin tunefs'
> wouldn't interfere with normal repository operation, correct?  If
> that's true then I'd say it's not a lot to ask to run this in the
> post-commit process or as a cron-job.  However, if it does
> interfere, I would -1 this.  I would have to agree with Jack in that
> this should be site-friendly and 'global development' friendly.

It wouldn't interfere -- users would never know that it was running.

(Still neutral on a solution, just providing some information to
John.)


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
Jack Repenning <jr...@collab.net> writes:

> At 4:30 PM -0600 11/4/03, C. Michael Pilato wrote:
> >
> >I know of no reason why a user would have *more* problems to
> >deal with, even if 'svnadmin tunefs' is called as frequently as during
> >a post-commit hook.
> 
> OK, the lock management strategy appears to be safe (or "correctness
> preserving under this transformation").
> 
> What about the queue management for things needing deltification, in
> case it's run less often, and the queue builds up enough that the next
> scheduled run starts before the last completes?  Manage that (I'm sure
> you can, you're such a clever boy!), and I'll feel better.

Actually, I'm not interested in solving that problem right now.  We
have a *known issue* with in-process deltification.  We can solve that
issue essentially for free -- no apparent tradeoff penalties except
for forcing repository admins to do a little extra work (which, by the
way, can also be interpreted as *empowering* admins with the ability
to tune deltification to their needs).

We have no data to back this queue concept as a necessity.  Therefore,
this late in the game, we either a) need to acquire that data from a
reliable source (so we can address a real issue instead of a ghost of
one), or b) stop wasting precious time theorizing about potential
problems and fix the ones we *know* to exist.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Jack Repenning <jr...@collab.net>.
At 4:30 PM -0600 11/4/03, C. Michael Pilato wrote:
>
>I know of no reason why a user would have *more* problems to
>deal with, even if 'svnadmin tunefs' is called as frequently as during
>a post-commit hook.

OK, the lock management strategy appears to be safe (or "correctness 
preserving under this transformation").

What about the queue management for things needing deltification, in 
case it's run less often, and the queue builds up enough that the 
next scheduled run starts before the last completes?  Manage that 
(I'm sure you can, you're such a clever boy!), and I'll feel better.

At 6:24 PM -0600 11/4/03, kfogel@collab.net wrote:
>Enabling it for files over a certain size doesn't help much, because
>it turns off the feature precisely where it would help the most.

Interesting claim ("where it would help most").  I fear the 
determination of how much deltification helps in any given case is 
considerably more subtle than mere size.
-- 
-==-
Jack Repenning
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
o: 650.228.2562
c: 408.835.8090
f: 650.228.2501

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
John Szakmeister <jo...@szakmeister.net> writes:

> I'd hate to see anything that would push back more problems to a
> user.  I.e., an increased potential for typical repository
> operations to fail.

Me too!  I know of no reason why a user would have *more* problems to
deal with, even if 'svnadmin tunefs' is called as frequently as during
a post-commit hook.

> I wasn't sure how you were proposing this, and from some of the
> communication, I was frightened that we'd essentially have to take
> the repository offline in order to run this deltification process.

That would be called "MIScommunication".  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by John Szakmeister <jo...@szakmeister.net>.
On Tuesday 04 November 2003 17:10, C. Michael Pilato wrote:
> John Szakmeister <jo...@szakmeister.net> writes:
> > If I understand how this works correctly, this 'svnadmin tunefs'
> > wouldn't interfere with normal repository operation, correct?  If
> > that's true then I'd say it's not a lot to ask to run this in the
> > post-commit process or as a cron-job.  However, if it does
> > interfere, I would -1 this.  I would have to agree with Jack in that
> > this should be site-friendly and 'global development' friendly.
>
> What do you mean by interfere?  Berkeley DB locking struggles?  Node
> revisions getting deltified by multiple processes at the same time?

I should've been more clear about that.  I mean 'interfere' as in it raises 
the current bar for potential conflicts with locking, deltification by 
multiple processes, and for potential errors/problems that the user has to 
deal with.  I'd hate to see anything that would push back more problems to a 
user.  I.e., an increased potential for typical repository operations to 
fail.

> The proposal is very simple.  Today we have a routine which runs over
> the items changed in a particular revision and deltifies the
> predecessors of those items.  'svnadmin tunefs' would simply relocate
> calls to that code from the commit process to an external program.
> All the ways in which multiple instances of the deltification code
> could knock heads with each other in the current scenario would still
> be present in the new one -- but not additional ones, and as noted,
> the individual user experience is greatly enhanced.

I wasn't sure how you were proposing this, and from some of the communication, 
I was frightened that we'd essentially have to take the repository offline in 
order to run this deltification process.  However, if the current set up can 
handle it, and the only difference is that we have to call 'svnadmin tunefs' 
to do the deltification, then I'm alright with that (not that my opinion 
counts for anything :-), especially if it enhances the user experience.

-John


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by kf...@collab.net.
Mike Mason <mg...@thoughtworks.net> writes:
> Is this why I'm seeing a (fairly long) pause between my client
> finishing "transmitting file data" and my command prompt actually
> returning? The server's doing a bunch of bookeeping to record the
> changes? I've got 8000 files under source control and regularly commit
> changes to several hundred of them (don't ask!) and so tend to notice
> it.

Yes indeed, this is probably the major cause of that.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Mike Mason <mg...@thoughtworks.net>.
C. Michael Pilato wrote:

>
>The proposal is very simple.  Today we have a routine which runs over
>the items changed in a particular revision and deltifies the
>predecessors of those items.  'svnadmin tunefs' would simply relocate
>calls to that code from the commit process to an external program.
>All the ways in which multiple instances of the deltification code
>could knock heads with each other in the current scenario would still
>be present in the new one -- but not additional ones, and as noted,
>the individual user experience is greatly enhanced.  
>  
>
Is this why I'm seeing a (fairly long) pause between my client finishing 
"transmitting file data" and my command prompt actually returning? The 
server's doing a bunch of bookeeping to record the changes? I've got 
8000 files under source control and regularly commit changes to several 
hundred of them (don't ask!) and so tend to notice it.

Cheers,
Mike.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
John Szakmeister <jo...@szakmeister.net> writes:

> If I understand how this works correctly, this 'svnadmin tunefs'
> wouldn't interfere with normal repository operation, correct?  If
> that's true then I'd say it's not a lot to ask to run this in the
> post-commit process or as a cron-job.  However, if it does
> interfere, I would -1 this.  I would have to agree with Jack in that
> this should be site-friendly and 'global development' friendly.

What do you mean by interfere?  Berkeley DB locking struggles?  Node
revisions getting deltified by multiple processes at the same time?

The proposal is very simple.  Today we have a routine which runs over
the items changed in a particular revision and deltifies the
predecessors of those items.  'svnadmin tunefs' would simply relocate
calls to that code from the commit process to an external program.
All the ways in which multiple instances of the deltification code
could knock heads with each other in the current scenario would still
be present in the new one -- but not additional ones, and as noted,
the individual user experience is greatly enhanced.  

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by John Szakmeister <jo...@szakmeister.net>.
On Tuesday 04 November 2003 10:01, C. Michael Pilato wrote:
> kfogel@collab.net writes:
[snip]
>      3. Never do deltification of any sort in the filesystem code, and
>         create an out-of-band compression command that can be run as a
>         post-commit hook.
>
> > (2) looks like a wonderful solution; the only thing I'm not sure of is
> > how to do it inside an Apache module.  Does anyone know?
> >
> > I assume that (1) would involve a repository config option for the
> > file size.  Note also that we used to have an 'svnadmin deltify'
> > command and could easily get it back (see r3920), so (1) may not
> > actually be as much work as it looks like.  Those who don't want to
> > run the cron job would just set the size limit to infinity, and always
> > get deltification.
>
>   (3) looks simple, involves no repository configuration option, and
>       removes all the deltification overhead from the commit process
>       itself.  O(1) commits, finally.
>
> You have my vote.  Subversion chants the "disk is cheap" mantra all
> over the place.  If we really believe that, it won't hurt to stop
> deltifying in-process and start doing it in the hooks, even adding the
> exact command-line for running the 'svnadmin tunefs' (or whatever)
> command necessary in the post-commit.tmpl template.

If I understand how this works correctly, this 'svnadmin tunefs' wouldn't 
interfere with normal repository operation, correct?  If that's true then I'd 
say it's not a lot to ask to run this in the post-commit process or as a 
cron-job.  However, if it does interfere, I would -1 this.  I would have to 
agree with Jack in that this should be site-friendly and 'global development' 
friendly.

Cheers,
John


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Greg Hudson <gh...@MIT.EDU>.
On Wed, 2003-11-05 at 15:15, Greg Stein wrote:
> > You can do a forced flush.
> 
> Euh... please, no. We really don't want to get into the business of
> monkeying with what Apache decides is the proper behavior for delivery of
> data to the socket.

This is not "monkeying."  This is politely informing Apache that we
would like the client to know something, and since we're going to run
off and do something computationally intensive, we'd like the client to
know it now.  It's morally equivalent to using fflush() with stdio,
which is not "monkeying with what the C library decides is the proper
behavior for delivery of data to the file."


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Branko Čibej <br...@xbc.nu>.
kfogel@collab.net wrote:

>That's all great, but it's a bit of work, and if it happened in 1.1
>that wouldn't be a disaster, because there's no compatibility step
>(and no real harm if cron jobs try to do work where there's none to be
>done, for a while after the upgrade).  And, the immediate change I'm
>proposing is just one step on the way to Brane's solution anyway.
>None of the work would need to be undone.
>
>Is this crazy, or practical?
>  
>
It's probably the only thing we _can_ do for 1.0, short of dropping
deltification entirely.

>(These are not mutually exclusive, I suppose,
>
No comment :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Greg Hudson <gh...@MIT.EDU>.
On Sun, 2003-11-09 at 10:08, Glenn A. Thompson wrote:
> No I'm advocating pulling it out of svn_fs.h.  But I'll drop it now.
> Be warned I will make another attempt to move it:-)

We're going to wind up with public library functions which are only
intended for use by specific callers.  It's just something that happens
in a complicated system.  libsvn_ra_svn has piles of stuff which is only
really intended for use by svnserve, although I suppose one could
imagine someone using it in a third-party Subversion server.

We could invent a convention for separating these quasi-public functions
into different header files, but I think that qualifies as
over-engineering.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "Glenn A. Thompson" <gt...@cdr.net>.

C. Michael Pilato wrote:

>"Glenn A. Thompson" <gt...@cdr.net> writes:
>
>  
>
>>Given all the possibilities you mention I question more and more, the
>>benefit to putting it back into svnadmin *if* everyone agrees that
>>Branes "separate program" idea is the long term solution. All the work
>>is being done in the FS code anyway.  The role svnadmin plays can
>>easily be placed in another main.  See below.
>>    
>>
>
>Oh.  Actually, my latest plan involves a new binary, 'svntunefs',
>which can use subcommands to tune things in particular ways.
>
cool.

>
>  
>
>>>I know where you're heading here.  What business is it of the outside
>>>world to know a darned thing about the storage mechanism used by the
>>>database?  And I hear you, I really do.  But I see no compelling
>>>reason for this not to go in svn_fs.h.  It must be a public interface,
>>>
>>>      
>>>
>>Why?  The only program using it is a maintenance program.
>>    
>>
>
>I must be missing something, because I *know* you aren't advocating
>having a program calling functions in libsvn_fs that aren't exposed
>via an API.
>  
>
No I'm advocating pulling it out of svn_fs.h.  But I'll drop it now.
Be warned I will make another attempt to move it:-)




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
"Glenn A. Thompson" <gt...@cdr.net> writes:

> Given all the possibilities you mention I question more and more, the
> benefit to putting it back into svnadmin *if* everyone agrees that
> Branes "separate program" idea is the long term solution. All the work
> is being done in the FS code anyway.  The role svnadmin plays can
> easily be placed in another main.  See below.

Oh.  Actually, my latest plan involves a new binary, 'svntunefs',
which can use subcommands to tune things in particular ways.

> >I know where you're heading here.  What business is it of the outside
> >world to know a darned thing about the storage mechanism used by the
> >database?  And I hear you, I really do.  But I see no compelling
> >reason for this not to go in svn_fs.h.  It must be a public interface,
> >
> Why?  The only program using it is a maintenance program.

I must be missing something, because I *know* you aren't advocating
having a program calling functions in libsvn_fs that aren't exposed
via an API.

> >single function.  Would you feel better if it were called
> >svn_fs_deltify_berkeley()?
> >
> Absolutely not! :-)
> Since I seem to be on the loosing side of the argument for keeping it
> out of  svn_fs.h.  How about  it something like svn_fs_optimize or
> svn_fs_vacuum, or svn_fs_tune. gat

See above.  svn_fs_tunefs() would likely be the route I'd take.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Ben Collins-Sussman <su...@collab.net>.
On Sat, 2003-11-08 at 09:03, Glenn A. Thompson wrote:

> Since I seem to be on the loosing side of the argument for keeping it 
> out of  svn_fs.h.  How about  it something like svn_fs_optimize or 
> svn_fs_vacuum, or svn_fs_tune. 

<bikeshed>

* "optimize"?  no way.  optimization usually means "faster for the
user".  If anything, we're slowing down the speed at which users can
retrieve data!

* "vacuum"?  that sounds like we're cleaning up unwanted garbage.  but
we're certainly not losing any data.  this name would be more
appropriate for a command that tosses unused db logs.

* "tune"?  eh.  to me, this implies a whole lot of flexibility... as if
we're giving a bunch of options to the admin to finely-adjust the
behavior of the repository.

* "deltify"?  I've never liked this name, because it's too svn-jargony. 
nobody knows what it means unless you start explaining what 'delta' or
'vdelta' means.

My vote is for "compress".  It's a universally recognized computer term,
requires no explanation, and describes *exactly* what's going on.

</bikeshed>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by kf...@collab.net.
"Glenn A. Thompson" <gt...@cdr.net> writes:
> Since I seem to be on the loosing side of the argument for keeping it
> out of  svn_fs.h.  How about  it something like svn_fs_optimize or
> svn_fs_vacuum, or svn_fs_tune. gat

Oy vey :-).

Call it svn_fs_deltify, document that it may become obsolete someday,
and that we recommend avoiding it if possible.  And explain why.

We'd hardly be the first API to contain such caveats.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "Glenn A. Thompson" <gt...@cdr.net>.
...

>So many options.  So potentially nasty an interface.  
>
>But I'm working on it.  
>
>The gears are turning.  
>
>I can hear the ... uh-oh!  That's a grindstone that I'm lying on!
>  
>

Given all the possibilities you mention I question more and more, the 
benefit to putting it back into svnadmin *if* everyone agrees that 
Branes "separate program" idea is the long term solution. All the work 
is being done in the FS code anyway.  The role svnadmin plays can easily 
be placed in another main.  See below.

>  
>
>>Ignoring the unsupported issue, I like what you propose. One thing
>>though, IMHO svn_fs_deltify should not be part of the *core*
>>svn_fs.h API.  Would you consider *not* putting svn_fs_deltify back
>>in to svn_fs.h?  I'd prefer seeing it in some sort of maintenance
>>header file.
>>    
>>
>
>I know where you're heading here.  What business is it of the outside
>world to know a darned thing about the storage mechanism used by the
>database?  And I hear you, I really do.  But I see no compelling
>reason for this not to go in svn_fs.h.  It must be a public interface,
>
Why?  The only program using it is a maintenance program.

>and there's no reason to introduce a new public interface file for a
>single function.  Would you feel better if it were called
>svn_fs_deltify_berkeley()?
>
Absolutely not! :-)
Since I seem to be on the loosing side of the argument for keeping it 
out of  svn_fs.h.  How about  it something like svn_fs_optimize or 
svn_fs_vacuum, or svn_fs_tune. 

gat






---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Branko Čibej <br...@xbc.nu>.
C. Michael Pilato wrote:

>"Glenn A. Thompson" <gt...@cdr.net> writes:
>  
>
>>Ignoring the unsupported issue, I like what you propose. One thing
>>though, IMHO svn_fs_deltify should not be part of the *core*
>>svn_fs.h API.  Would you consider *not* putting svn_fs_deltify back
>>in to svn_fs.h?  I'd prefer seeing it in some sort of maintenance
>>header file.
>>    
>>
>
>I know where you're heading here.  What business is it of the outside
>world to know a darned thing about the storage mechanism used by the
>database?  And I hear you, I really do.  But I see no compelling
>reason for this not to go in svn_fs.h.  It must be a public interface,
>and there's no reason to introduce a new public interface file for a
>single function.  Would you feel better if it were called
>svn_fs_deltify_berkeley()?
>  
>
Deltification has nothing to do with the FS back-end, either. What fun,
eh? :-)


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
"Glenn A. Thompson" <gt...@cdr.net> writes:

> Hey,
> 
> >I hope these two points will make the following proposal less
> >controversial than it might otherwise be:
> >
> >Let's turn off fs deltification by default, restore the command
> >'svnadmin deltify' from r3920 or thereabouts, and document (both in
> >user documentation and in the post-commit template) how to run both
> >manual deltification and logfile cleanup, with a cron example as well
> >as a hook example.
> >
> I remember this being un-implemented.  I did a quick check and
> svnadmin calls svn_fs_deltify which returns
> SVN_ERR_UNSUPPORTED_FEATURE in every revison of deltify.c I looked
> at. Is there some simple "calling of" an existing function that
> resolves this problem? Did I miss something?

No, you didn't miss anything.  It was unsupported for what was at the
time a decent reason.  Back then, we only every deltified a node
against its successor.  Of course, we can't *know* a node's successor
with confidence without a really expensive bit of filesystem
scouring.  So, I punted.

Moving deltification out of the commit process does introduce a bit of
theoretical ick, similar to what I ran into svn_fs_deltify() back
then.  Say my repository generates 10 new revisions per day, and I
have a script that runs nightly to deltify stuff.  Well, what should I
deltify, and against what?  I could loop over the 10 revisions and
deltify the predecessors of nodes changed in that revision against the
nodes themselves -- that'd be the easiest thing, and would be the
exact same functionality provided today.  Or I could just deltify the
nodes that changed in the revision against the empty string (in other
words, just do compression).  Or maybe both.  Or I could look for the
paths in HEAD (verifying that if I find an existent thing there, it is
ancestrally related to my node) and deltify against that, falling back
to deltifying against an empty string if I can't find the youngest
successor.

So many options.  So potentially nasty an interface.  

But I'm working on it.  

The gears are turning.  

I can hear the ... uh-oh!  That's a grindstone that I'm lying on!

> Ignoring the unsupported issue, I like what you propose. One thing
> though, IMHO svn_fs_deltify should not be part of the *core*
> svn_fs.h API.  Would you consider *not* putting svn_fs_deltify back
> in to svn_fs.h?  I'd prefer seeing it in some sort of maintenance
> header file.

I know where you're heading here.  What business is it of the outside
world to know a darned thing about the storage mechanism used by the
database?  And I hear you, I really do.  But I see no compelling
reason for this not to go in svn_fs.h.  It must be a public interface,
and there's no reason to introduce a new public interface file for a
single function.  Would you feel better if it were called
svn_fs_deltify_berkeley()?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "Glenn A. Thompson" <gt...@cdr.net>.
Hey,

>I hope these two points will make the following proposal less
>controversial than it might otherwise be:
>
>Let's turn off fs deltification by default, restore the command
>'svnadmin deltify' from r3920 or thereabouts, and document (both in
>user documentation and in the post-commit template) how to run both
>manual deltification and logfile cleanup, with a cron example as well
>as a hook example.
>
I remember this being un-implemented.  I did a quick check and svnadmin calls svn_fs_deltify which returns SVN_ERR_UNSUPPORTED_FEATURE in every revison of deltify.c I looked at. Is there some simple "calling of" an existing function that resolves this problem? Did I miss something?

Ignoring the unsupported issue, I like what you propose. 
One thing though, IMHO svn_fs_deltify should not be part of the *core* 
svn_fs.h API.
Would you consider *not* putting svn_fs_deltify back in to svn_fs.h?
I'd prefer seeing it in some sort of maintenance header file.

thanks,
gat


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by kf...@collab.net.
Two things seem pretty clear from this discussion:

   1. The status quo is useless, because the space savings of
      deltification are undone by the additional BDB logfiles.
      (Unless you run log cleanup on a regular basis, of course.)

   2. The status quo is harmful, because it causes users to wait
      longer for commits to return, especially with large files.

I hope these two points will make the following proposal less
controversial than it might otherwise be:

Let's turn off fs deltification by default, restore the command
'svnadmin deltify' from r3920 or thereabouts, and document (both in
user documentation and in the post-commit template) how to run both
manual deltification and logfile cleanup, with a cron example as well
as a hook example.

I am *not* proposing this as an ideal solution, merely one better than
what we have now.  Right now, we have deltification, but it doesn't
help unless the admin takes extra steps.  If they have to take extra
steps anyway, we can make the deltification be part of that and thus
avoid the user-wait problem.

For the long run, I think the best solution comes from Brane:

> Why not? Anyway, we don't have to put this into "svnadmin", we can
> create a custom binary (e.g., svndeltify) and simply demand that it _is_
> installed in a place where Apache, svnserve, etc. can find it. It's no
> more complicated than installing mod_dav_svn.so, and it's a _lot_ easier
> than making default hook scripts or cron scripts that work on all platforms.

In other words, Subversion itself would still drive deltification,
unconditionally, but we would use portable APR methods to fire it off
as a background process (maybe APR can even 'nice' it on some
platforms?) so the commit could finish right away.

That's all great, but it's a bit of work, and if it happened in 1.1
that wouldn't be a disaster, because there's no compatibility step
(and no real harm if cron jobs try to do work where there's none to be
done, for a while after the upgrade).  And, the immediate change I'm
proposing is just one step on the way to Brane's solution anyway.
None of the work would need to be undone.

Is this crazy, or practical?

(These are not mutually exclusive, I suppose, but FWIW, Mike and Ben
also thought the latter.)

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Branko Čibej <br...@xbc.nu>.
C. Michael Pilato wrote:

>Branko Čibej <br...@xbc.nu> writes:
>
>  
>
>>I agree. But I don't think we should put that in a post-commit hook,
>>even a "default" post-commit hook, the hook scripts themselves are
>>platform-specific. Instead, the FS code should ron "svnadmin deltify" or
>>whatever directly.
>>    
>>
>
>And where/how should it do that?  Remember, we have to background this
>process, else what's the point?  And we can't bank on 'svnadmin'
>actually being named 'svnadmin', or being in, say, Apache's PATH, or...
>  
>
Why not? Anyway, we don't have to put this into "svnadmin", we can
create a custom binary (e.g., svndeltify) and simply demand that it _is_
installed in a place where Apache, svnserve, etc. can find it. It's no
more complicated than installing mod_dav_svn.so, and it's a _lot_ easier
than making default hook scripts or cron scripts that work on all platforms.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
Branko Čibej <br...@xbc.nu> writes:

> I agree. But I don't think we should put that in a post-commit hook,
> even a "default" post-commit hook, the hook scripts themselves are
> platform-specific. Instead, the FS code should ron "svnadmin deltify" or
> whatever directly.

And where/how should it do that?  Remember, we have to background this
process, else what's the point?  And we can't bank on 'svnadmin'
actually being named 'svnadmin', or being in, say, Apache's PATH, or...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: issue #1573: fs deltification causes delays

Posted by Branko Čibej <br...@xbc.nu>.
Sander Striker wrote:

>I don't really see a clean option here other than doing deltification in
>a seperate process.
>  
>
I agree. But I don't think we should put that in a post-commit hook,
even a "default" post-commit hook, the hook scripts themselves are
platform-specific. Instead, the FS code should ron "svnadmin deltify" or
whatever directly.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: issue #1573: fs deltification causes delays

Posted by Sander Striker <st...@apache.org>.
> From: Greg Stein [mailto:gstein@lyra.org]
> Sent: Wednesday, November 05, 2003 9:16 PM

[...]
> > You can do a forced flush.
> 
> Euh... please, no. We really don't want to get into the business of
> monkeying with what Apache decides is the proper behavior for delivery of
> data to the socket.

I didn't say you should.  I said you can.
 
> As Justin pointed out, putting a clean on the request pool might work out.
> However, that *does* imply that other request cleanups could be delayed,
> and it also delays a return to Apache to deal with the connection object.
> In particular, we want to return control to Apache so that it can process
> the DELETE request which directly follows the MERGE request.

Not to mention that you shouldn't do allocations, subpool creation, etc.
inside a pool cleanup.

I don't really see a clean option here other than doing deltification in
a seperate process.


Sander



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Greg Stein <gs...@lyra.org>.
On Wed, Nov 05, 2003 at 07:18:30PM +0100, Sander Striker wrote:
> > From: cmpilato@localhost.localdomain
> > [mailto:cmpilato@localhost.localdomain]On Behalf Of C. Michael Pilato
> > Sent: Wednesday, November 05, 2003 7:09 PM
> 
> > > What if we were to do the deltification on the same thread that serviced
> > > the request, but after the response has been written to the socket?
> > 
> > This was my original idea, but there's an API problem.  I don't
> > believe that Apache guarantees that things we write to our output
> > stream are flushed to the socket immediately.  The *right* way to do
> > this would be for Apache/mod_dav to provide a post-request cleanup
> > hook, which is guaranteed to run only after the request response has
> > been sent down the wire.  In that cleanup, we could do the
> > deltification.
> > 
> > But if Apache doesn't have this functionality already, we don't have
> > time to wait for another release of that code.
> 
> You can do a forced flush.

Euh... please, no. We really don't want to get into the business of
monkeying with what Apache decides is the proper behavior for delivery of
data to the socket.

As Justin pointed out, putting a clean on the request pool might work out.
However, that *does* imply that other request cleanups could be delayed,
and it also delays a return to Apache to deal with the connection object.
In particular, we want to return control to Apache so that it can process
the DELETE request which directly follows the MERGE request.

The cleanup really ought to be on the connection object so that we don't
slow down request processing. But even that is a bit shaky, as we don't
know the relative ordering of the deltification against the closure of the
socket. It would suck to see Apache decide to close the connection, but a
client doesn't know that is happening until several minutes later.
Meanwhile, they're hung up on sending another set of requests over the
connection, only to be delayed on knowing they have to close and reopen a
connection.

IMO, I like the idea of a fully out-of-band mechanism in the post-commit.
Alternatively, I would also like to see a "standard" script that does the
deltification and the log file pruning, and is intended to be run by cron
periodically. I would also suggest this script can be configured to handle
N repositories on the box, so that an admin can just tweak a .conf file
rather than setting up Yet Another Cron Entry.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
--On Wednesday, November 05, 2003 12:09:16 -0600 "C. Michael Pilato" 
<cm...@collab.net> wrote:

> This was my original idea, but there's an API problem.  I don't
> believe that Apache guarantees that things we write to our output
> stream are flushed to the socket immediately.  The *right* way to do
> this would be for Apache/mod_dav to provide a post-request cleanup
> hook, which is guaranteed to run only after the request response has
> been sent down the wire.  In that cleanup, we could do the
> deltification.

Smells like a pool cleanup to me.  Register against the request_rec->pool 
(or some other pool).  You might get delayed until connection close, but 
that's not a super-big problem.  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: issue #1573: fs deltification causes delays

Posted by Sander Striker <st...@apache.org>.
> From: cmpilato@localhost.localdomain
> [mailto:cmpilato@localhost.localdomain]On Behalf Of C. Michael Pilato
> Sent: Wednesday, November 05, 2003 7:09 PM

> > What if we were to do the deltification on the same thread that serviced
> > the request, but after the response has been written to the socket?
> 
> This was my original idea, but there's an API problem.  I don't
> believe that Apache guarantees that things we write to our output
> stream are flushed to the socket immediately.  The *right* way to do
> this would be for Apache/mod_dav to provide a post-request cleanup
> hook, which is guaranteed to run only after the request response has
> been sent down the wire.  In that cleanup, we could do the
> deltification.
> 
> But if Apache doesn't have this functionality already, we don't have
> time to wait for another release of that code.

You can do a forced flush.

Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
mark benedetto king <mb...@lowlatency.com> writes:

> On Wed, Nov 05, 2003 at 09:52:52AM +0100, Sander Striker wrote:
> > 
> > > >>    2. Make svn_fs_merge() spawn a deltification thread (using APR
> > > >>       threads) and return success immediately.  If the thread fails to
> > > >>       deltify, it's not the end of the world: we simply don't get the
> > > >>       disk-space savings.
> > 
> > For 2. we'd need both a fork()ed and a threaded implementation, based
> > on availability advertised by APR at compile time.
> > 
> 
> What if we were to do the deltification on the same thread that serviced
> the request, but after the response has been written to the socket?

This was my original idea, but there's an API problem.  I don't
believe that Apache guarantees that things we write to our output
stream are flushed to the socket immediately.  The *right* way to do
this would be for Apache/mod_dav to provide a post-request cleanup
hook, which is guaranteed to run only after the request response has
been sent down the wire.  In that cleanup, we could do the
deltification.

But if Apache doesn't have this functionality already, we don't have
time to wait for another release of that code.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by mark benedetto king <mb...@lowlatency.com>.
On Wed, Nov 05, 2003 at 09:52:52AM +0100, Sander Striker wrote:
> 
> > >>    2. Make svn_fs_merge() spawn a deltification thread (using APR
> > >>       threads) and return success immediately.  If the thread fails to
> > >>       deltify, it's not the end of the world: we simply don't get the
> > >>       disk-space savings.
> 
> For 2. we'd need both a fork()ed and a threaded implementation, based
> on availability advertised by APR at compile time.
> 

What if we were to do the deltification on the same thread that serviced
the request, but after the response has been written to the socket?

--ben


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On Nov 4, 2003, at 7:24 PM, kfogel@collab.net wrote:

> Garrett Rooney <ro...@electricjellyfish.net> writes:
>> I really don't find the idea of having to run separate commands to get
>> deltification very intuitive.  If I was a new user, I would be shocked
>> if deltas were not used by default.  If anything, this should only be
>> enabled for files over a certain size, so that everyone does not need
>> to pay the price.
>
> Some devil's advocate questions/comments:
>
> How would a new user ever find out, let alone be shocked?

Reading it in the docs maybe?  Noticing this 'svnadmin deltify' command 
or whatever we call it?  It just seems odd to be requiring people to do 
extra work to get the space optimization of using deltas instead of 
storing fulltext.

> Enabling it for files over a certain size doesn't help much, because
> it turns off the feature precisely where it would help the most.
> Making it a boolean config option (either deltify always, or not at
> all) might be a better way.

Perhaps, but then you're requiring a config change if you want to 
commit a change to a particularly huge file...

> If Mike's results are representative, the logfile penalty we're paying
> for deltification outweights the space savings.

Yeah, but log files get created for more things than deltification, and 
will always need to be cleaned up.  Plus, it just seems more natural to 
have to delete db logfiles eventually than it does to have to run extra 
commands to deltify your repository.  Just my opinion, I could be 
wrong.

-garrett


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by kf...@collab.net.
Garrett Rooney <ro...@electricjellyfish.net> writes:
> I really don't find the idea of having to run separate commands to get
> deltification very intuitive.  If I was a new user, I would be shocked
> if deltas were not used by default.  If anything, this should only be
> enabled for files over a certain size, so that everyone does not need
> to pay the price.

Some devil's advocate questions/comments:

How would a new user ever find out, let alone be shocked?

Enabling it for files over a certain size doesn't help much, because
it turns off the feature precisely where it would help the most.
Making it a boolean config option (either deltify always, or not at
all) might be a better way.

If Mike's results are representative, the logfile penalty we're paying
for deltification outweights the space savings.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: issue #1573: fs deltification causes delays

Posted by Sander Striker <st...@apache.org>.
> From: Garrett Rooney [mailto:rooneg@electricjellyfish.net]
> Sent: Wednesday, November 05, 2003 1:07 AM

> On Nov 4, 2003, at 10:01 AM, C. Michael Pilato wrote:
> 
> > kfogel@collab.net writes:
> >
> >> There are various proposed solutions in the issue.  But for now, I'd
> >> like to talk just about solutions we can implement before 1.0 (i.e.,
> >> before Beta, i.e., before 0.33 :-) ).  The two that seem most
> >> realistic are:
> >>
> >>    1. Prevent deltification on files over a certain size, but create
> >>       some sort of out-of-band compression command -- something like
> >>       'svnadmin deltify/compress/whatever' that a sysadmin or cron job
> >>       can run during non-peak hours to reclaim disk space.

I'd rather go for all or nothing.  Not something based on file size.

> >>    2. Make svn_fs_merge() spawn a deltification thread (using APR
> >>       threads) and return success immediately.  If the thread fails to
> >>       deltify, it's not the end of the world: we simply don't get the
> >>       disk-space savings.

For 2. we'd need both a fork()ed and a threaded implementation, based
on availability advertised by APR at compile time.

> >      3. Never do deltification of any sort in the filesystem code, and
> >         create an out-of-band compression command that can be run as a
> >         post-commit hook.
> 
> I really don't find the idea of having to run separate commands to get 
> deltification very intuitive.  If I was a new user, I would be shocked 
> if deltas were not used by default.  If anything, this should only be 
> enabled for files over a certain size, so that everyone does not need 
> to pay the price.

Well, I could go for this, but we'd need a default working post-commit
hook that does this, instead of just a template.
 
> > I say all this to promote the idea that it isn't too much to ask of a
> > repos administrator to run some out-of-process deltification routine
> > -- even per-commit -- because if they are truly concerned about disk
> > space, they'll already have some out-of-process log-file cleanup
> > process.  And if you have a cronjob/post-commit hook to cleanup
> > logfiles, what's an extra line in that script to deltify?
> 
> People already freak when they find out about the logfiles stuff.  I 
> don't like the idea of adding more administrative overhead if it can be 
> avoided.

I think a working default post-commit hook, which does:
 a) deltification
 b) log cleanup

would be a fair solution.


Sander

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On Nov 4, 2003, at 10:01 AM, C. Michael Pilato wrote:

> kfogel@collab.net writes:
>
>> There are various proposed solutions in the issue.  But for now, I'd
>> like to talk just about solutions we can implement before 1.0 (i.e.,
>> before Beta, i.e., before 0.33 :-) ).  The two that seem most
>> realistic are:
>>
>>    1. Prevent deltification on files over a certain size, but create
>>       some sort of out-of-band compression command -- something like
>>       'svnadmin deltify/compress/whatever' that a sysadmin or cron job
>>       can run during non-peak hours to reclaim disk space.
>>
>>    2. Make svn_fs_merge() spawn a deltification thread (using APR
>>       threads) and return success immediately.  If the thread fails to
>>       deltify, it's not the end of the world: we simply don't get the
>>       disk-space savings.
>
>      3. Never do deltification of any sort in the filesystem code, and
>         create an out-of-band compression command that can be run as a
>         post-commit hook.

I really don't find the idea of having to run separate commands to get 
deltification very intuitive.  If I was a new user, I would be shocked 
if deltas were not used by default.  If anything, this should only be 
enabled for files over a certain size, so that everyone does not need 
to pay the price.

> I say all this to promote the idea that it isn't too much to ask of a
> repos administrator to run some out-of-process deltification routine
> -- even per-commit -- because if they are truly concerned about disk
> space, they'll already have some out-of-process log-file cleanup
> process.  And if you have a cronjob/post-commit hook to cleanup
> logfiles, what's an extra line in that script to deltify?

People already freak when they find out about the logfiles stuff.  I 
don't like the idea of adding more administrative overhead if it can be 
avoided.

-garrett


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by pm...@users.sourceforge.net.
> > So we'll get a (data-based) list of (crc, hash, start, length) blocks,
> > which we then compare against the "new" file.
> > In my upcoming perl-module "Digest::Manber" I take another value as well
> > - the crc prior to the boundary.
>
> So, you're proposing that we store block checksums in each
> representation, so that when we diff two representations against each
> other, you can skip past the windows which have matching checksums.
>
> If we stored plaintexts as deltas against the empty source, we could do
> this with (drumroll) window checksums.  Which CMike just eliminated from
> the schema.  Cue Alanis Morisette song here.
Well, the windows would have to be determined by the rolling crc anyway, so 
that they are re-synchronized by inserted characters ... checksums of every X 
kB wouldn't work.
And I'd store checksums eventually only for the plain-text files, ie the last 
revision - but now that you mention it, they don't exist anymore with 
delta-against-empty ... I remember having read this on the list.

But it seems that this is a moot point ...


Regards,

P.Marek


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by Greg Hudson <gh...@MIT.EDU>.
On Thu, 2003-11-06 at 02:34, pmarek@users.sourceforge.net wrote:
> So we'll get a (data-based) list of (crc, hash, start, length) blocks, which 
> we then compare against the "new" file.
> In my upcoming perl-module "Digest::Manber" I take another value as well - the 
> crc prior to the boundary.

So, you're proposing that we store block checksums in each
representation, so that when we diff two representations against each
other, you can skip past the windows which have matching checksums.

If we stored plaintexts as deltas against the empty source, we could do
this with (drumroll) window checksums.  Which CMike just eliminated from
the schema.  Cue Alanis Morisette song here.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by pm...@users.sourceforge.net.
I have to say first that I'm not really familiar with the way svn handles 
deltification now - so please be patient and just say when I'm telling stupid 
things. :-)

> > There are various proposed solutions in the issue.  But for now, I'd
> > like to talk just about solutions we can implement before 1.0 (i.e.,
> > before Beta, i.e., before 0.33 :-) ).  The two that seem most
> > realistic are:
> >
> >    1. Prevent deltification on files over a certain size, but create
> >       some sort of out-of-band compression command -- something like
> >       'svnadmin deltify/compress/whatever' that a sysadmin or cron job
> >       can run during non-peak hours to reclaim disk space.
> >
> >    2. Make svn_fs_merge() spawn a deltification thread (using APR
> >       threads) and return success immediately.  If the thread fails to
> >       deltify, it's not the end of the world: we simply don't get the
> >       disk-space savings.
>
>      3. Never do deltification of any sort in the filesystem code, and
>         create an out-of-band compression command that can be run as a
>         post-commit hook.

Another solution, which may not be done by 0.33, would be the following:

If we trust that there'll be no hash-collisions (in SHA or MD5 or whatever - 
which may not hold true [1]) then we'll just save the hash of blocks of data.
The boundaries are determined by having a rolling CRC (see also [2] ), and a 
boundary is where eg. the last 14bits of the crc are zero.

So we'll get a (data-based) list of (crc, hash, start, length) blocks, which 
we then compare against the "new" file.
In my upcoming perl-module "Digest::Manber" I take another value as well - the 
crc prior to the boundary.
So we would have eg 128bit hash, 32bit CRC, and length information to compare 
for each block, which should make synchronisation faster - we don't have to 
compare two full files against each other, but can take a list (probably 
sorted by hash).

I don't exactly know what is implemented today - but maybe that would make 
deltification faster (at the expense of harddisk space, of course).

 
[1]: "An analysis of compare-by-hash" http://www.nmt.edu/~val/review/hash.pdf
[2]: "Finding Similar Files in a Large File System" 
http://citeseer.nj.nec.com/manber94finding.html


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: issue #1573: fs deltification causes delays

Posted by "C. Michael Pilato" <cm...@collab.net>.
kfogel@collab.net writes:

> There are various proposed solutions in the issue.  But for now, I'd
> like to talk just about solutions we can implement before 1.0 (i.e.,
> before Beta, i.e., before 0.33 :-) ).  The two that seem most
> realistic are:
> 
>    1. Prevent deltification on files over a certain size, but create
>       some sort of out-of-band compression command -- something like
>       'svnadmin deltify/compress/whatever' that a sysadmin or cron job
>       can run during non-peak hours to reclaim disk space.
> 
>    2. Make svn_fs_merge() spawn a deltification thread (using APR
>       threads) and return success immediately.  If the thread fails to
>       deltify, it's not the end of the world: we simply don't get the
>       disk-space savings.

     3. Never do deltification of any sort in the filesystem code, and
        create an out-of-band compression command that can be run as a
        post-commit hook.

> (2) looks like a wonderful solution; the only thing I'm not sure of is
> how to do it inside an Apache module.  Does anyone know?
> 
> I assume that (1) would involve a repository config option for the
> file size.  Note also that we used to have an 'svnadmin deltify'
> command and could easily get it back (see r3920), so (1) may not
> actually be as much work as it looks like.  Those who don't want to
> run the cron job would just set the size limit to infinity, and always
> get deltification.

  (3) looks simple, involves no repository configuration option, and
      removes all the deltification overhead from the commit process
      itself.  O(1) commits, finally.

You have my vote.  Subversion chants the "disk is cheap" mantra all
over the place.  If we really believe that, it won't hurt to stop
deltifying in-process and start doing it in the hooks, even adding the
exact command-line for running the 'svnadmin tunefs' (or whatever)
command necessary in the post-commit.tmpl template.  

Some kind of post-commit cleanup is necessary anyway, because I have a
strong suspicion that what we save in in-database storage by
deltifying, we lose temporarily in out-of-database storage thanks to
the logfiles generated during the deltification process.  Heh... I
just ran the fs-test binary as-as.  44.6 Megs of disk consumed by my
tests/libsvn_fs directory (and the test repos in it) now.  So, tweaky
tweaky... turn off deltification in tree.c... compile... re-run
fs-test -- 44.1 Megs.  Nice.

fs-test doesn't *nearly* represent normal usage, though some of the
tests do cover non-small binary files, and of course there are lots of
tests of really small (Greek) file mods.  But, with no post-commit
processing at all, we see that, at least for that dataset, it is
cheaper in terms of disk usage *and* speed (no proof of this, but I
think that's a trustworthy proposition) to *not* deltify *anything*.
Of course, if we had a script to remove logfiles, our deltification
would surely have paid off (at least, space-wise).  

I say all this to promote the idea that it isn't too much to ask of a
repos administrator to run some out-of-process deltification routine
-- even per-commit -- because if they are truly concerned about disk
space, they'll already have some out-of-process log-file cleanup
process.  And if you have a cronjob/post-commit hook to cleanup
logfiles, what's an extra line in that script to deltify?


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org