You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Hyrum K Wright <hy...@hyrumwright.org> on 2011/02/01 05:29:16 UTC

Deltifying directories on the server

Philip and I had an interesting conversation with some users this
evening, and I'm just archiving my brain dump here.

These users have a large repository with a large number of branches in
the /branches directory (~35k).  We described the well-known
phenomenon in which directories aren't deltified on commit, and thus
cause the repository to have very large revisions, even when the
actual content changes are fairly small.  This is due to bubble up and
having to re-write the entire directory list of the /branches
directory.

Philip recalled a time several years ago when he enabled directory
deltification, but the performance was awful, and we've never released
it.  In our discussion, we mentioned that directory deltification may
be better performing now, especially in light of the imminent merge of
the diff-bytes-optimizations branch.  In the case of a bubble-up
directory modification, the prefix and suffix matching would simplify
the problem space, leaving a very small diff.

The only trouble with the above theory is if directory entry lists are
stored in a hash, and are serialized in an unordered manner, thus
negating any benefits prefix-scanning would provide (and potentially
causing the horrific delta performance in the first place).

Anyway, that was the kernel of our discussion.  I haven't dug around
in the code to determine how much of it is true or not, but if anybody
wants something to do, this might be interesting.

-Hyrum

Re: Deltifying directories on the server

Posted by Hyrum K Wright <hy...@hyrumwright.org>.

On Tue, Feb 1, 2011 at 10:54 AM, Greg Hudson <gh...@mit.edu> wrote:
> On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote:
>> I can only really speak for the BDB side of things, but... "what he said".
>
> I'll elaborate a little bit.  API issues aside, we're used to putting
> artifacts from different versions in different places.  More so in FSFS,
> where it was baked into the initial architecture, but also in BDB for
> the most part.
>
> The most efficient storage for large directories which frequently change
> by small deltas would be some kind of multi-rooted B-tree.  To do that
> efficiently (that is, without scattering each tiny change into a
> separate disk block, requiring lots and lots of opens/seeks/reads),
> you'd want to put artifacts from different versions of a directory all
> in the same place.  You might be able to arrange it so that modifying a
> directory is an append-only operation, avoiding the need for a lot of
> copying, but you'd still want a place to append to for each directory,
> which isn't how FSFS or BDB works.
>
> So, I'm not sure we can ever have efficient handling of this use case
> without designing a completely new back end--which wouldn't be a
> terrible idea if someone's up to it.

There's been a bit of talk here and there about a new back end, but
nothing concrete just yet.

Thanks all for the insight.

-Hyrum

Re: Deltifying directories on the server

Posted by Greg Hudson <gh...@MIT.EDU>.

On Tue, 2011-02-01 at 10:29 -0500, C. Michael Pilato wrote:
> I can only really speak for the BDB side of things, but... "what he said".

I'll elaborate a little bit.  API issues aside, we're used to putting
artifacts from different versions in different places.  More so in FSFS,
where it was baked into the initial architecture, but also in BDB for
the most part.

The most efficient storage for large directories which frequently change
by small deltas would be some kind of multi-rooted B-tree.  To do that
efficiently (that is, without scattering each tiny change into a
separate disk block, requiring lots and lots of opens/seeks/reads),
you'd want to put artifacts from different versions of a directory all
in the same place.  You might be able to arrange it so that modifying a
directory is an append-only operation, avoiding the need for a lot of
copying, but you'd still want a place to append to for each directory,
which isn't how FSFS or BDB works.

So, I'm not sure we can ever have efficient handling of this use case
without designing a completely new back end--which wouldn't be a
terrible idea if someone's up to it.

Re: Deltifying directories on the server

Posted by "C. Michael Pilato" <cm...@collab.net>.

On 02/01/2011 04:43 AM, Branko Čibej wrote:
> You do know that "diff" and "delta" are two different beasts, and that
> the diff optimizations have no effect on deltas? :)
> 
> The problem with directory deltification lies in the length of the delta
> chain and the frequency of directory lookup compared to file access. The
> sad fact is that our directory storage (/and/ our API) are woefully
> unsuited to their tasks. The way they're stored now (in both BDB and
> FSFS back-ends) requires the whole directory to be read into memory and
> hashed in order to find a single entry, and you have to do this for each
> level of directories when resolving a path. It doesn't help that most of
> the APIs are strictly path based, e.g., editor drives will do the lookup
> any number of times.
> 
> The whole concept of directory storage needs to be changed. The easiest
> way would be to store directories as B-trees, however, that doesn't play
> nice with versioning. On the other hand, directory structure is well
> known and there's no reason to use a generic delta algorithm to store
> them. I could probably come up with a better storage schema for
> directories in a couple weeks, but I don't have the time to implement
> such a thing.
> 
> -- Brane

I can only really speak for the BDB side of things, but... "what he said".

-- 
C. Michael Pilato <cm...@collab.net>
CollabNet   <>   www.collab.net   <>   Distributed Development On Demand

Re: Deltifying directories on the server

Posted by Branko Čibej <br...@e-reka.si>.

You do know that "diff" and "delta" are two different beasts, and that
the diff optimizations have no effect on deltas? :)

The problem with directory deltification lies in the length of the delta
chain and the frequency of directory lookup compared to file access. The
sad fact is that our directory storage (/and/ our API) are woefully
unsuited to their tasks. The way they're stored now (in both BDB and
FSFS back-ends) requires the whole directory to be read into memory and
hashed in order to find a single entry, and you have to do this for each
level of directories when resolving a path. It doesn't help that most of
the APIs are strictly path based, e.g., editor drives will do the lookup
any number of times.

The whole concept of directory storage needs to be changed. The easiest
way would be to store directories as B-trees, however, that doesn't play
nice with versioning. On the other hand, directory structure is well
known and there's no reason to use a generic delta algorithm to store
them. I could probably come up with a better storage schema for
directories in a couple weeks, but I don't have the time to implement
such a thing.

-- Brane

On 01.02.2011 05:29, Hyrum K Wright wrote:
> Philip and I had an interesting conversation with some users this
> evening, and I'm just archiving my brain dump here.
>
> These users have a large repository with a large number of branches in
> the /branches directory (~35k).  We described the well-known
> phenomenon in which directories aren't deltified on commit, and thus
> cause the repository to have very large revisions, even when the
> actual content changes are fairly small.  This is due to bubble up and
> having to re-write the entire directory list of the /branches
> directory.
>
> Philip recalled a time several years ago when he enabled directory
> deltification, but the performance was awful, and we've never released
> it.  In our discussion, we mentioned that directory deltification may
> be better performing now, especially in light of the imminent merge of
> the diff-bytes-optimizations branch.  In the case of a bubble-up
> directory modification, the prefix and suffix matching would simplify
> the problem space, leaving a very small diff.
>
> The only trouble with the above theory is if directory entry lists are
> stored in a hash, and are serialized in an unordered manner, thus
> negating any benefits prefix-scanning would provide (and potentially
> causing the horrific delta performance in the first place).
>
> Anyway, that was the kernel of our discussion.  I haven't dug around
> in the code to determine how much of it is true or not, but if anybody
> wants something to do, this might be interesting.
>
> -Hyrum