You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Evgeny Kotkov <ev...@visualsvn.com> on 2018/10/22 20:14:43 UTC

[PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

Branko Čibej <br...@apache.org> writes:

> Still missing is a mechanism for the libsvn_wc (and possibly
> libsvn_client) to determine the capabilities of the working copy at
> runtime (this will be needed for deciding whether to use compressed
> pristines).

FWIW, I tried the idea of using LZ4 to compress the pristines and storing small
pristines as blobs in the `PRISTINE` table.  I was particularly interested in
how such change would affect the performance and what kind of obstacles
would have to be dealt with.

In the attachment you will find a more or less functional implementation of
this idea that might be useful to some extent.  The patch is a proof of
concept: it doesn't include the WC compatibility bits and most certainly
doesn't have everything necessary in place.  But in the meanwhile, I think
that is might give a good approximation of what can be expected from the
approach.

The patch applies to the `better-pristines` branch.

A couple of observations:

 - As expected, the combined size of the pristines is halved when the data
   itself is compressible, thus making the working copy 25% smaller.

 - A variety of the callers currently access the pristine contents by reading
   the corresponding files.  That doesn't work in case of compressed pristines
   or pristines stored as BLOBs.

   I think that ideally we would want to use streams as much as possible, and
   only spill the uncompressed pristine contents to temporary files when we
   need to pass them to external tools, etc.; and that temporary files need
   to be backed by a work queue to avoid leaving them in place in case of an
   application crash.

   The patch does that kind of plumbing to some extent, but that part of the
   work is not complete.  The starting point is around wc_db_pristine.c:
   svn_wc__db_pristine_get_path().

 - Using BLOBs to store the pristine contents didn't have a measurable impact
   on the speed of the WC operations such as checkout in my experiments on
   Windows.  These experiments were not comprehensive, and also I didn't run
   the tests on *nix.

 - There's also the deprecated svn_wc_get_pristine_copy_path() public API that
   would require plumbing to maintain compatibility; the patch performs it by
   spilling the pristine contents result into a temporary file whose lifetime
   is attached to the `result_pool`.

 (I probably won't be able to continue the work on this patch in the nearby
 future; posting this in case it might be useful.)


Thanks,
Evgeny Kotkov

Re: [PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

Posted by Branko Čibej <br...@apache.org>.
On 29.10.2018 14:27, Bert Huijben wrote:
> On Windows' NTFS implementation very small files (probably something
> like < 256 bytes, but this is not documented/strictly stable) are
> stored in the directory table and so don't use 'a whole cluster'.

Right — and this is exactly the kind of optimisation we're aiming for by
putting small(ish) pristine blobs directly into the working copy database.

-- Brane


Re: [PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

Posted by Bert Huijben <be...@qqmail.nl>.
On Windows' NTFS implementation very small files (probably something like <
256 bytes, but this is not documented/strictly stable) are stored in the
directory table and so don't use 'a whole cluster'.

Nice work on all the research!

    Bert

On Tue, Oct 23, 2018 at 6:12 PM, Branko Čibej <br...@apache.org> wrote:

> On 22.10.2018 22:14, Evgeny Kotkov wrote:
> > Branko Čibej <br...@apache.org> writes:
> >
> >> Still missing is a mechanism for the libsvn_wc (and possibly
> >> libsvn_client) to determine the capabilities of the working copy at
> >> runtime (this will be needed for deciding whether to use compressed
> >> pristines).
> > FWIW, I tried the idea of using LZ4 to compress the pristines and
> storing small
> > pristines as blobs in the `PRISTINE` table.  I was particularly
> interested in
> > how such change would affect the performance and what kind of obstacles
> > would have to be dealt with.
>
> Nice! I did some simpler tests by compressing exported trees, but this
> is definitely better.
>
> > In the attachment you will find a more or less functional implementation
> of
> > this idea that might be useful to some extent.  The patch is a proof of
> > concept: it doesn't include the WC compatibility bits and most certainly
> > doesn't have everything necessary in place.  But in the meanwhile, I
> think
> > that is might give a good approximation of what can be expected from the
> > approach.
> >
> > The patch applies to the `better-pristines` branch.
> >
> > A couple of observations:
> >
> >  - As expected, the combined size of the pristines is halved when the
> data
> >    itself is compressible, thus making the working copy 25% smaller.
>
> Yes, that was my observation as well. In fact, though, storing small
> BLOBs in the database itself should have even better effects, since the
> space on disk actually used by a file is rounded up to the nearest
> cluster size, but SQLite's blocks are typically much smaller than that.
>
>
> >  - A variety of the callers currently access the pristine contents by
> reading
> >    the corresponding files.  That doesn't work in case of compressed
> pristines
> >    or pristines stored as BLOBs.
> >
> >    I think that ideally we would want to use streams as much as
> possible, and
> >    only spill the uncompressed pristine contents to temporary files when
> we
> >    need to pass them to external tools, etc.; and that temporary files
> need
> >    to be backed by a work queue to avoid leaving them in place in case
> of an
> >    application crash.
>
> Yes and yes. Keeping those temporary spilled files on disk could turn
> out to be a problem, finding a reasonable time to delete them without
> having to run cleanup will be rather important, I think.
>
>
> >    The patch does that kind of plumbing to some extent, but that part of
> the
> >    work is not complete.  The starting point is around wc_db_pristine.c:
> >    svn_wc__db_pristine_get_path().
> >
> >  - Using BLOBs to store the pristine contents didn't have a measurable
> impact
> >    on the speed of the WC operations such as checkout in my experiments
> on
> >    Windows.  These experiments were not comprehensive, and also I didn't
> run
> >    the tests on *nix.
>
> I wouldn't expect much change in performance but would expect better use
> of the disk, as explained above.
>
> >  - There's also the deprecated svn_wc_get_pristine_copy_path() public
> API that
> >    would require plumbing to maintain compatibility; the patch performs
> it by
> >    spilling the pristine contents result into a temporary file whose
> lifetime
> >    is attached to the `result_pool`.
>
> Ack; that's one reasonable definition of "lifetime." But I suspect that
> any users of that function expect the pristine file to survive at least
> to the next WC cleanup.
>
> >  (I probably won't be able to continue the work on this patch in the
> nearby
> >  future; posting this in case it might be useful.)
>
> Thanks, it definitely is useful!
>
> -- Brane
>
>

Re: [PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

Posted by Branko Čibej <br...@apache.org>.
On 22.10.2018 22:14, Evgeny Kotkov wrote:
> Branko Čibej <br...@apache.org> writes:
>
>> Still missing is a mechanism for the libsvn_wc (and possibly
>> libsvn_client) to determine the capabilities of the working copy at
>> runtime (this will be needed for deciding whether to use compressed
>> pristines).
> FWIW, I tried the idea of using LZ4 to compress the pristines and storing small
> pristines as blobs in the `PRISTINE` table.  I was particularly interested in
> how such change would affect the performance and what kind of obstacles
> would have to be dealt with.

Nice! I did some simpler tests by compressing exported trees, but this
is definitely better.

> In the attachment you will find a more or less functional implementation of
> this idea that might be useful to some extent.  The patch is a proof of
> concept: it doesn't include the WC compatibility bits and most certainly
> doesn't have everything necessary in place.  But in the meanwhile, I think
> that is might give a good approximation of what can be expected from the
> approach.
>
> The patch applies to the `better-pristines` branch.
>
> A couple of observations:
>
>  - As expected, the combined size of the pristines is halved when the data
>    itself is compressible, thus making the working copy 25% smaller.

Yes, that was my observation as well. In fact, though, storing small
BLOBs in the database itself should have even better effects, since the
space on disk actually used by a file is rounded up to the nearest
cluster size, but SQLite's blocks are typically much smaller than that.


>  - A variety of the callers currently access the pristine contents by reading
>    the corresponding files.  That doesn't work in case of compressed pristines
>    or pristines stored as BLOBs.
>
>    I think that ideally we would want to use streams as much as possible, and
>    only spill the uncompressed pristine contents to temporary files when we
>    need to pass them to external tools, etc.; and that temporary files need
>    to be backed by a work queue to avoid leaving them in place in case of an
>    application crash.

Yes and yes. Keeping those temporary spilled files on disk could turn
out to be a problem, finding a reasonable time to delete them without
having to run cleanup will be rather important, I think.


>    The patch does that kind of plumbing to some extent, but that part of the
>    work is not complete.  The starting point is around wc_db_pristine.c:
>    svn_wc__db_pristine_get_path().
>
>  - Using BLOBs to store the pristine contents didn't have a measurable impact
>    on the speed of the WC operations such as checkout in my experiments on
>    Windows.  These experiments were not comprehensive, and also I didn't run
>    the tests on *nix.

I wouldn't expect much change in performance but would expect better use
of the disk, as explained above.

>  - There's also the deprecated svn_wc_get_pristine_copy_path() public API that
>    would require plumbing to maintain compatibility; the patch performs it by
>    spilling the pristine contents result into a temporary file whose lifetime
>    is attached to the `result_pool`.

Ack; that's one reasonable definition of "lifetime." But I suspect that
any users of that function expect the pristine file to survive at least
to the next WC cleanup.

>  (I probably won't be able to continue the work on this patch in the nearby
>  future; posting this in case it might be useful.)

Thanks, it definitely is useful!

-- Brane


Re: [PATCH] Proof of concept of the better-pristines (LZ4 + storing small pristines as BLOBs) (Was: Re: svn commit: r1843076)

Posted by Branko Čibej <br...@apache.org>.
On 22.10.2018 22:14, Evgeny Kotkov wrote:
> @@ -104,9 +103,13 @@
>  CREATE TABLE PRISTINE (
> ··/*·Alternative·MD5·checksum·used·for·communicating·with·older
> ·····repositories.·Not·strictly·guaranteed·to·be·unique·among·table·rows.·*/
>
> -··md5_checksum··TEXT·NOT·NULL
> +··md5_checksum··TEXT·NOT·NULL,
> +
> +··data··BLOB
> ··);
> +/*·TODO:·While·here,·perhaps·add·`WITHOUT·ROWID`·for·the·new·schema·table?·*/
>
> +

Note that this is no longer how schema changes will work. Instead, we
create the oldest supported schema (currently 1.8) and then roll forward
to the desired format.

We could probably add WITHOUT ROWID anyway, but that will only affect
new working copies, not upgrades of existing ones.

-- Brane