You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@wandisco.com> on 2013/02/16 10:52:21 UTC

FSFS format7 status and first results

Hey all,

Just to give you an update on what is going on that branch,
here a few facts and numbers.  Bottom line is that there is
still a lot to do but the basic assumptions proved correct and
significant benefits can already be demonstrated.

* about 20% of the coding is done so far
* some core features implemented:
  logical addressing, reorg upon pack, block read
* format 7 repos are ~3x faster due to reduced I/O
* format 6 repos get faster by ~10%

* format 7 performance hit by cache inefficiency;
  will be addressed by 'container' objects plus a
  change in the membuffer implementation
* on-disk format still in the flux and will remain
  so for a while

Tests were run against a local copy of the TSVN repo with
revprop compression, directory deltification and property
deltification enabled to bring format 6 structurally as
close to format 7 defaults as possible.

All values are given in user-time triples for trunk@1446787
on format 6, fsfs-format7@1446862 on format 6 and format 7.
"hot" runs mean "from OS cache"; svnserve was restarted
before every test run.

$ time svnadmin verify -M 4000 -q $repo

medium      trunk / branch6 / branch7
USB cold   270.5s / 246.3s / 104.4s   = 1.00 : 1.10 : 2.59
USB hot    248.0s / 191.0s /  60.8s   = 1.00 : 1.30 : 4.08
SSD cold    66.1s /  62.0s /  59.4s   = 1.00 : 1.07 : 1.11
SSD hot     63.2s /  60.0s /  57.1s   = 1.00 : 1.05 : 1.11

$ time svnbench null-export svn://localhost/tsvn/trunk -q

USB cold   44.29s / 39.63s / 16.24s   = 1.00 : 1.12 : 2.73
USB hot    10.68s / 10.25s /  3.78s   = 1.00 : 1.04 : 2.83
SSD cold    5.72s /  5.00s /  3.75s   = 1.00 : 1.14 : 1.53
SSD hot     2.37s /  2.38s /  3.21s   = 1.00 : 1.00 : 0.74

$ time svnbench null-log svn://localhost/tsvn/trunk -v -q

USB cold   54.36s / 50.17s /  8.73s   = 1.00 : 1.06 : 6.11
USB hot    43.64s / 36.46s /  3.52s   = 1.00 : 1.20 :12.40
SSD cold    9.32s / 10.60s /  3.22s   = 1.00 : 0.88 : 2.89
SSD hot     2.36s /  2.28s /  2.88s   = 1.00 : 1.04 : 0.82

$ time svnbench null-log svn://localhost/tsvn/trunk -v -g -q

USB cold   98.02s / 87.01s / 23.74s   = 1.00 : 1.13 : 4.13
USB hot    69.88s / 57.14s /  7.88s   = 1.00 : 1.22 : 8.87
SSD cold    8.35s / 10.50s /  8.16s   = 1.00 : 0.80 : 1.02
SSD hot     5.94s /  5.72s /  6.39s   = 1.00 : 1.04 : 0.93

Tests have been conducted with maximum optimization:

  ./configure --disable-shared --disable-debug --enable-optimize \
  --without-berkeley-db -without-serf CUSERFLAGS='-march=native'

Svnserve configuration:

  svnserve -dTr $repos -c 0 -M 1000 --client-speed 100 --foreground \
  --cache-txdeltas yes --cache-fulltexts yes --cache-revprops yes

Machine:

  Core2 Duo 2.4Ghz, 8GB RAM, Ubuntu 12.04, 64 bit, SMP
  128GB SSD built-in, ext4
  320GB USB2 HDD external, NTFS

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: FSFS format7 and compressed XML bundles

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 28, 2013 at 1:45 PM, Ben Reser <be...@reser.org> wrote:

> On Thu, Feb 28, 2013 at 8:28 AM, Mark Phippard <ma...@gmail.com> wrote:
> > FWIW, the Branch Readme does imply he intends to work on some things that
> > might have an impact here.
>

I pasted the contents of the readme merely to point out that he indicates
that he is looking at changes beyond the on-disk storage of the fs, and
that those changes "might have an impact here".  I was not saying anything
else.  It might make things better or worse, it might never ever get worked
on.  Just pointing it out.

I think there has been a lot of evidence presented on our lists over the
years, that if we had a way to bypass the entire deltification process on
large binaries that do not deltify well, it would greatly improve
performance in those cases.  I think it would be good if that part of the
proposal was implemented.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Thu, Feb 28, 2013 at 8:28 AM, Mark Phippard <ma...@gmail.com> wrote:
> FWIW, the Branch Readme does imply he intends to work on some things that
> might have an impact here.  Specifically:
>
> TxDelta v2
> ----------
>
> Version 1 of txdelta turns out to be limited in its effectiveness for
> larger files when data gets inserted or removed.  For typical office
> documents (zip files), deltification often becomes ineffective.
>
> Version 2 shall introduce the following changes:
>
> - increase the delta window from 100kB to 1MB
> - use a sliding window instead of a fixed-sized one
> - use a slightly more efficient instruction encoding

I think that the office documents example in the above is a really
poor example for what he's looking at doing.  An adaptively compressed
file is unlikely to be stored more efficiently just because of changes
to the delta windows.  The instruction encoding will help ever so
slightly but not enough to be of any consequence tot he issue at hand
here.

> Large file storage
> ------------------
>
> Even most source code repositories contain large, hard to compress,
> hard to deltify binaries.  Reconstructing their content becomes very I/O
> intense and it "dilutes" the data in our pack files.  The latter makes
> e.g. caching, prefetching and packing less efficient.
>
> Once a representation exceeds a certain configured threshold (16M default),
> the fulltext of that item will be stored in a separate file.  This will
> be marked in the representation_t by an extra flag and future reps will
> not be deltified against it.  From that location, the data can be forwarded
> directly via SendFile and the fulltext caches will not be used for it.
>
> Note that by making the decision contingent upon the size of the deltified
> and packed representation,  all large data that benefits from these will
> still be stored within the rev and pack files.

This would help more than the previously mentioned changes because
there won't be additional overhead from our deltification which is
made inefficient by the fact the file is already compressed.  However,
you'll still be stuck with storing full texts of each revision.  Which
again is not what I think in the end is a desired outcome.  But only
if the file is over 16MB.  I'm not sure how often that applies to
these files.

Re: FSFS format7 and compressed XML bundles

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 28, 2013 at 11:25 AM, Branko Čibej <br...@wandisco.com> wrote:

> On 28.02.2013 08:04, Magnus Thor Torfason wrote:
> > Hey all,
> >
> > I've been following the discussion about FSFS format7, and had a
> > question: Is there any chance that the format would improve storage
> > efficiency for documents that are stored as compressed (zipped)
> > bundles of XML files and other resource files (Read MS Office
> > Documents, but OpenOffice is similar).
> >
> > I'm finding that making very small changes in big documents (with
> > embedded images) results in rapid growth of the repository, since the
> > binary diff algorithm seems to not be able to figure out efficient
> > deltas for this type of documents, even though analysis of the
> > contents shows that they are almost unchanged.
> >
> > This may be outside the scope of format7, but I thought I'd ask the
> > question nevertheless.
>
> It is outside the scope, format7 is about physical storage layout and
> does not affect the delta/compression layer -- which is the one
> responsible for the effect you're seeing.
>
> We're aware of the issues regarding compressed files, and I expect will
> eventually come up with a solution. The problem just hasn't seemed all
> that important compared to other things we're trying to solve.
>
> That said, I'm sure we'd welcome any suggestions about how to handle
> such files more efficiently. I can think of a few (e.g., decompress the
> files before deltifying them), but it's always good to hear other points
> of view.
>

FWIW, the Branch Readme does imply he intends to work on some things that
might have an impact here.  Specifically:

TxDelta v2
----------

Version 1 of txdelta turns out to be limited in its effectiveness for
larger files when data gets inserted or removed.  For typical office
documents (zip files), deltification often becomes ineffective.

Version 2 shall introduce the following changes:

- increase the delta window from 100kB to 1MB
- use a sliding window instead of a fixed-sized one
- use a slightly more efficient instruction encoding

When introducing it,  we will make it an option at the txdelta interfaces
(e.g. a format number).  The version will be indicated in the 'SVN\x1' /
'SVN\x2' stream header.  While at it, (try to) fix the layering violations
where those prefixes are being read or written.

Large file storage
------------------

Even most source code repositories contain large, hard to compress,
hard to deltify binaries.  Reconstructing their content becomes very I/O
intense and it "dilutes" the data in our pack files.  The latter makes
e.g. caching, prefetching and packing less efficient.

Once a representation exceeds a certain configured threshold (16M default),
the fulltext of that item will be stored in a separate file.  This will
be marked in the representation_t by an extra flag and future reps will
not be deltified against it.  From that location, the data can be forwarded
directly via SendFile and the fulltext caches will not be used for it.

Note that by making the decision contingent upon the size of the deltified
and packed representation,  all large data that benefits from these will
still be stored within the rev and pack files.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: FSFS format7 and compressed XML bundles

Posted by Branko Čibej <br...@wandisco.com>.

On 28.02.2013 08:04, Magnus Thor Torfason wrote:
> Hey all,
>
> I've been following the discussion about FSFS format7, and had a
> question: Is there any chance that the format would improve storage
> efficiency for documents that are stored as compressed (zipped)
> bundles of XML files and other resource files (Read MS Office
> Documents, but OpenOffice is similar).
>
> I'm finding that making very small changes in big documents (with
> embedded images) results in rapid growth of the repository, since the
> binary diff algorithm seems to not be able to figure out efficient
> deltas for this type of documents, even though analysis of the
> contents shows that they are almost unchanged.
>
> This may be outside the scope of format7, but I thought I'd ask the
> question nevertheless.

It is outside the scope, format7 is about physical storage layout and
does not affect the delta/compression layer -- which is the one
responsible for the effect you're seeing.

We're aware of the issues regarding compressed files, and I expect will
eventually come up with a solution. The problem just hasn't seemed all
that important compared to other things we're trying to solve.

That said, I'm sure we'd welcome any suggestions about how to handle
such files more efficiently. I can think of a few (e.g., decompress the
files before deltifying them), but it's always good to hear other points
of view.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Fri, Mar 1, 2013 at 6:30 AM, Vincent Lefevre <vi...@vinc17.net> wrote:
> On 2013-03-01 14:24:07 +0000, Philip Martin wrote:
>> $ gzip --help | grep rsync
>>   --rsyncable       Make rsync-friendly archive
>> $ dpkg -s gzip | grep Version
>> Version: 1.5-1.1
>
> OK, then it seems that its man page is out-of-date.

The patch probably just doesn't adjust the man page.  Also the patch
has been left out at least once and then added back in.

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2013-03-01 14:24:07 +0000, Philip Martin wrote:
> $ gzip --help | grep rsync
>   --rsyncable       Make rsync-friendly archive
> $ dpkg -s gzip | grep Version
> Version: 1.5-1.1

OK, then it seems that its man page is out-of-date.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: FSFS format7 and compressed XML bundles

Posted by Philip Martin <ph...@wandisco.com>.

Vincent Lefevre <vi...@vinc17.net> writes:

> On 2013-02-28 10:58:07 -0800, Ben Reser wrote:
>> Speaking with Julian here at ApacheCon he mentioned that gzip has a
>> rsyncable option.  Looking into this turns out that there is a patch
>> applied to Debian's gzip that provides this option.
>
> I can't see such an option in Debian's gzip (from unstable, and
> there is no experimental version).

$ gzip --help | grep rsync
  --rsyncable       Make rsync-friendly archive
$ dpkg -s gzip | grep Version
Version: 1.5-1.1

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Fri, Mar 1, 2013 at 5:49 PM, Julian Foad <ju...@btopenworld.com> wrote:
> No, that's not true.  I think the article Ben read was inaccurate.  The '--rsyncable' option doesn't reset the compression after a fixed number of bytes, but rather at every point where a rolling checksum of the last N bytes leading up to that point has a certain value.  It will resynchronize after an insertion or deletion.  The intervals between resets are irregular but deterministic.
>
> Here's an old but readable description and proof-of-concept: <http://svana.org/kleptog/rgzip.html>.
>
> Here's an announcement of implementation in pigz: <http://mail.zlib.net/pipermail/pigz-announce_zlib.net/2012-January/000003.html>.  It's described in more detail in a big comment near the beginning of 'pigz.c' in the source tarball available at <http://zlib.net/pigz/>.

This is the patch that Debian is actually applying to gzip:
http://ozlabs.org/~rusty/gzip.rsync.patch

This was the article I had read which sadly is misleading:
http://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/

Julian is right, the window is rolling and so it works for inserts and
deletions.

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

> Vincent Lefevre wrote:
>> Ben Reser wrote:
>>>  It resets the compression algorithm every 1000 bytes and thus makes
>>>  blocks that can be saved between revisions of the file.
>> 
>>  Wouldn't this work only when data are appended to the file?

>>  If data are inserted or deleted, this would change the block
>>  boundaries. Instead of fixed-length blocks, I'd rather see
>>  boundaries based on the file contents.
> 
> That's true, the compression blocks are fixed.

No, that's not true.  I think the article Ben read was inaccurate.  The '--rsyncable' option doesn't reset the compression after a fixed number of bytes, but rather at every point where a rolling checksum of the last N bytes leading up to that point has a certain value.  It will resynchronize after an insertion or deletion.  The intervals between resets are irregular but deterministic.

Here's an old but readable description and proof-of-concept: <http://svana.org/kleptog/rgzip.html>.

Here's an announcement of implementation in pigz: <http://mail.zlib.net/pipermail/pigz-announce_zlib.net/2012-January/000003.html>.  It's described in more detail in a big comment near the beginning of 'pigz.c' in the source tarball available at <http://zlib.net/pigz/>.

Philip Martin wrote:
> Julian Foad <ju...@btopenworld.com> writes:
> 
>>  Yes, a client-side plug-in -- either to Subversion or to OpenOffice --
>>  seems to me the best practical solution.
> 
> A server-side solution is difficult.  Suppose the client has some
> uncompressed content U which it compresses to C and sends to the server.
> The server can uncompress C to get U but unless the compression scheme
> has a canonical compressed form, with no other forms allowed, the server
> cannot avoid storing C because there is no guarantee that C can be
> reconstructed from U.

Yes, a server-side solution would have lots of problems including that one.  Scalability is another -- keeping the server up to date with plug-ins for all (or most) of the compressed content types that the clients are using.

A client-side plug-in does not have those problems, at least not to the same extent.  It does have its own problems, though, including installation & configuration & portability issues.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Fri, Mar 1, 2013 at 5:44 AM, Vincent Lefevre <vi...@vinc17.net> wrote:
> Wouldn't this work only when data are appended to the file?
> If data are inserted or deleted, this would change the block
> boundaries. Instead of fixed-length blocks, I'd rather see
> boundaries based on the file contents.

That's true, the compression blocks are fixed.

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2013-02-28 10:58:07 -0800, Ben Reser wrote:
> Speaking with Julian here at ApacheCon he mentioned that gzip has a
> rsyncable option.  Looking into this turns out that there is a patch
> applied to Debian's gzip that provides this option.

I can't see such an option in Debian's gzip (from unstable, and
there is no experimental version).

> It resets the compression algorithm every 1000 bytes and thus makes
> blocks that can be saved between revisions of the file.

Wouldn't this work only when data are appended to the file?
If data are inserted or deleted, this would change the block
boundaries. Instead of fixed-length blocks, I'd rather see
boundaries based on the file contents.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

RE: FSFS format7 and compressed XML bundles

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Julian Foad [mailto:julianfoad@btopenworld.com]
> Sent: donderdag 28 februari 2013 20:54
> To: Ben Reser
> Cc: Magnus Thor Torfason; Subversion Development
> Subject: Re: FSFS format7 and compressed XML bundles
> 
> Ben Reser wrote:
> 
> > Speaking with Julian here at ApacheCon he mentioned that gzip has a
> > rsyncable option.  Looking into this turns out that there is a patch
> > applied to Debian's gzip that provides this option.  It resets the
> > compression algorithm every 1000 bytes and thus makes blocks that can
> 
> Use of such a zip format would be ideal -- Subversion's binary-delta would
> then calculate an excellent delta as long as each inserted chunk is are
smaller
> than the delta window size (currently 100 KB, Stefan's proposal 1 MB).

Word documents are not plain deflate streams. The office documents are
normal zip files with a different extension (and only supporting the default
compression type).

You can make this work for this file type, but it needs different code than
making gzip use the same trick.
(The deflate is restarted per file, and there are file headers between the
files and at the end)

	Bert

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2013-03-06 18:55:55 +0000, Julian Foad wrote:
> I don't know if anything like that would be feasible.  It may be
> possible in theory but too complex in practice.  The parameters we
> need to extract would include such things as the Huffman coding
> tables used and also parameters that influence deeper implementation
> details of the compression algorithm.  And of course for each
> compression algorithm we'd need an implementation that accepts all
> of these parameters -- an off-the-shelf compression library probably
> wouldn't be support this.

The parameters could also be provided by the user, e.g. via a svn
property. For instance, if the user wants some file "file.xz" to
be handled uncompressed, he can add a svn:compress property whose
value is "xz -9" (if the -9 option was used). Then the client
would do a "unxz" on the file. If the user wants the bit pattern
to be preserved, the client would also do a "xz -9" on the unxz
output. If some command fails or the result is not identical to
the initial file (for preserved bit pattern), the file would be
handled compressed (or the client should issue an error message
if requested by the user). Otherwise the file could be handled
uncompressed. This is the basic idea. Then there are various
implementation choices, such as whether the commands should be
part of Subversion or external commands provided by the system.

With a property, it would not be possible to change the behavior
on past revisions, but tools could do that on a svn dump.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

Vincent Lefevre wrote:

> On 2013-03-05 16:52:30 +0000, Julian Foad wrote:
>>  Vincent Lefevre wrote:
> [about server-side vs client-side]
[...]
> Because the diff between two huge compressed files is generally huge
> (unless some rsync-friendly option has been applied, when available).
> So, if the client doesn't uncompress the data for the server, it will
> have to send a huge diff or a huge compressed file, even though the
> diff between the uncompressed data may be small. So, if
> deconstruction/reconstruction is possible (canonical form),
> it is much more efficient to do this on the client side.

Certainly that is true.

>>>>  That point _is_ specific to a server-side solution.  With a
>>>>  client-side solution, the user's word processor may not mind if a
>>>>  versioning operation such as a commit (through a decompressing
>>>>  plug-in) followed by checkout (through a re-compressing plug-in)
>>>>  changes the bit pattern of the compressed file, so long as the
>>>>  uncompressed content that it represents is unchanged.
>>> 
>>> I disagree.
>> 
>>  It's not clear what you disagree with.
> 
> With the second sentence ("... may not mind ..."), thus with the first
> sentence too.
[...]
>>> The word processor may not mind (in theory, because
>>> in practice, one may have bugs that depend on the bit pattern,
>>> and it would be bad to expose the user to such kind of bugs and
>>> non-deterministic behavior), but for the user this may be important.
>>> For instance, a different bit pattern will break a possible signature
>>> on the compressed file.
>> 
>>  I agree that it *may* be important for the user, but the users have
>>  control so they can use this client-side scheme in scenarios where
>>  it works for them and not use it in other scenarios.
> 
> But one should need a scheme that will also work in the case where
> users care about the bit pattern of the compressed file.

> Moreover even when the users know that the exact bit pattern of the
> compressed file is not important at some time, this may no longer
> be true in the future. For instance, some current word processor may
> ignore the dates in zip files, but future ones may take them into
> account. So, you need to wonder what data are important in a zip
> file, including undocumented ones used by some implementations (as
> the zip format allows extensions). Taking them into account when it
> appears that these data become meaningful is too late, because such
> data would have already been lost in past versions of the Subversion
> repository.

If you are thinking about a solution that we can apply automatically, then yes it would 
need to work in the case where users care about preserving the bit 
pattern.

I was thinking about an opt-in system, where the user is in control of specifying which files get processed in this way.  If the user is unsure whether the non-preservation of bit pattern is going to be important for their word processor files in the future, they can ask the  provider of their word processor whether this kind of modification is officially supported.  In many cases the answer will be "yes, we explicitly support that kind of archiving".

> On 2013-03-05 17:10:02 +0000, Julian Foad wrote:
[...]
>>  Let me take that back.  The point that I interpreted as being the
>>  most significant impact of what Philip said, namely that the
>>  Subversion protocols and system design require reproducible content,
>>  is only a problem when done server-side.  Other impacts of that same
>>  point, such as you mentioned, are applicable no matter whether
>>  server-side or client-side.
> 
> The Subversion protocols and system design *currently* require
> reproducible content, but if new features are added, e.g. due to the
> fact that the users don't mind about the exact compressed content of
> some file, then it could be decided to change the protocols and the
> requirements (the server could consider some canonical uncompressed
> form as a reference).

Conceivably.

> [...]
>>  So my main point is that the server-side expand/compress is a
>>  non-starter of an idea, because it violates basic Subversion
>>  requirements, whereas client-side is a viable option for some use
>>  cases.
> 
> I would reject the server-side expand/compress, not because of the
> current requirements (which could be changed to more or less match
> what happens on the client side), but because of performance reasons
> (see my first paragraph of this message).

Interesting thoughts.

The design of a bit-pattern-preserving solution is an interesting 
challenge.  In general a compression algorithm may have no canonical form, and not even be deterministically reproducible using only data that is available in the compressed file, and in those cases I don't see any theroretical solution.  However, perhaps some commonly used compressions are found in practice to be in a form which can be 
reconstructed by the compression algorithm, if given a set of parameters 
that we are able to extract from the compressed data.

Perhaps it would be possible to design a scheme that scans the data stream for any such blocks (that are in one of the compression schemes it has been designed to recognize, such as 'deflate'), and extracts the parameters that will be necessary for exact recompression, and decompresses and recompresses at such times as to benefit from diffing the decompressed form.  This could work in theory if we can extract these parameters, and if 
diffing the plain text and preserving these parameters is cheaper than 
just diffing the compressed data.

I don't know if anything like that would be feasible.  It may be possible in theory but too complex in practice.  The parameters we need to extract would include such things as the Huffman coding tables used and also parameters that influence deeper implementation details of the 
compression algorithm.  And of course for each compression algorithm we'd need an implementation that accepts all of these parameters -- an off-the-shelf compression library probably wouldn't be support this.

I was assuming that the only feasible solutions would be opt-in solutions where the user is willing to accept that the bit pattern is not preserved.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

I (Julian Foad) wrote:
> Vincent Lefevre wrote:
>>  On 2013-03-05 13:30:28 +0000, Julian Foad wrote:
>>> Vincent Lefevre wrote:
>>>> On 2013-03-01 14:58:10 +0000, Philip Martin wrote:
>>>>> A server-side solution is difficult.  Suppose the client has some
>>>>> uncompressed content U which it compresses to C and sends to the server.
>>>>> The server can uncompress C to get U but unless the compression scheme
>>>>> has a canonical compressed form, with no other forms allowed, the server
>>>>> cannot avoid storing C because there is no guarantee that C can be
>>>>> reconstructed from U.
>>>> 
>>>> This is not specific to server side. Even on the client side, the
>>>> reconstruction may not be always possible, e.g. if the system is
>>>> upgraded or if NFS is used. And the compression level may need to
>>>> be detected or provided in some way.
>>> 
>>> Hi Vincent.  I'm not sure you understood Philip's point.
>> 
>> This should be more clear about what I meant below. What I'm saying is
>> that whether this is done entirely on the server side (a bad solution,
>> IMHO) or on the client side (see below why), the problems are similar.
> 
> The point Philip made is *not* a problem if done client-side;

Let me take that back.  The point that I interpreted as being the most significant impact of what Philip said, namely that the Subversion protocols and system design require reproducible content, is only a problem when done server-side.  Other impacts of that same point, such as you mentioned, are applicable no matter whether server-side or client-side.

Sorry for stubbornly applying my own interpretation there.

> some of the 
> *other* problems are similar no matter on which side we would do the 
> expansion/compression.
> 
>>> His point is (correct me if I'm wrong) that Subversion's design
>>> requires that during a checkout or update, the server must
>>> reconstruct a file containing exactly the same bit pattern that the
>>> client sent when committing the file.  Compression schemes in
>>> general don't guarantee that expanding and then compressing will
>>> produce the same compressed bit pattern, even if you take care to
>>> use the same "compression level".  Therefore, the server cannot
>>> simply expand the data before storing it and then re-compress it
>>> during checkout or update, because, although the resulting
>>> compressed file would be a valid representation of the user's data,
>>> it would not satisfy Subversion's own requirement that the bit
>>> pattern be identical to what was sent by the client during the
>>> commit.
>> 
>> You say that the server expands the data before storing it. This is
>> for a server-side only solution, I assume.
> 
> Yes, I'm talking about the server-side-only solution, which is one of the 
> hypothetical solutions that we are discussing and comparing.
> 
>> But even if there would
>> be no problems with the construction/reconstruction, it would be a
>> bad solution, IMHO. Indeed, for a commit, it is the client that is
>> supposed to expand the data before sending the diff to the server,
> 
> What do you mean "the client [...] is supposed to expand the data"?  I 
> don't understand why you think the client is "supposed" to do such 
> a thing.
> 
>> and for an update, it is the client that is supposed to recompress
>> the data before storing it to the WC. Actually, the server doesn't
>> need to know how the file was compressed, it just needs to record
>> information about the compression (but doesn't need to know what
>> this means exactly).
>> 
>>> That point _is_ specific to a server-side solution.  With a
>>> client-side solution, the user's word processor may not mind if a
>>> versioning operation such as a commit (through a decompressing
>>> plug-in) followed by checkout (through a re-compressing plug-in)
>>> changes the bit pattern of the compressed file, so long as the
>>> uncompressed content that it represents is unchanged.
>> 
>> I disagree.
> 
> It's not clear what you disagree with.
> 
>> The word processor may not mind (in theory, because
>> in practice, one may have bugs that depend on the bit pattern,
>> and it would be bad to expose the user to such kind of bugs and
>> non-deterministic behavior), but for the user this may be important.
>> For instance, a different bit pattern will break a possible signature
>> on the compressed file.
> 
> I agree that it *may* be important for the user, but the users have control so 
> they can use this client-side scheme in scenarios where it works for them and 
> not use it in other scenarios.

So my main point is that the server-side expand/compress is a non-starter of an idea, because it violates basic Subversion requirements, whereas client-side is a viable option for some use cases.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

Vincent Lefevre wrote:

> On 2013-03-05 13:30:28 +0000, Julian Foad wrote:
>>  Vincent Lefevre wrote:
>>  > On 2013-03-01 14:58:10 +0000, Philip Martin wrote:
>>  >>  A server-side solution is difficult.  Suppose the client has some
>>  >>  uncompressed content U which it compresses to C and sends to the server.
>>  >>  The server can uncompress C to get U but unless the compression scheme
>>  >>  has a canonical compressed form, with no other forms allowed, the server
>>  >>  cannot avoid storing C because there is no guarantee that C can be
>>  >>  reconstructed from U.
>>  > 
>>  > This is not specific to server side. Even on the client side, the
>>  > reconstruction may not be always possible, e.g. if the system is
>>  > upgraded or if NFS is used. And the compression level may need to
>>  > be detected or provided in some way.
>> 
>>  Hi Vincent.  I'm not sure you understood Philip's point.
> 
> This should be more clear about what I meant below. What I'm saying is
> that whether this is done entirely on the server side (a bad solution,
> IMHO) or on the client side (see below why), the problems are similar.

The point Philip made is *not* a problem if done client-side; some of the *other* problems are similar no matter on which side we would do the expansion/compression.

>>  His point is (correct me if I'm wrong) that Subversion's design
>>  requires that during a checkout or update, the server must
>>  reconstruct a file containing exactly the same bit pattern that the
>>  client sent when committing the file.  Compression schemes in
>>  general don't guarantee that expanding and then compressing will
>>  produce the same compressed bit pattern, even if you take care to
>>  use the same "compression level".  Therefore, the server cannot
>>  simply expand the data before storing it and then re-compress it
>>  during checkout or update, because, although the resulting
>>  compressed file would be a valid representation of the user's data,
>>  it would not satisfy Subversion's own requirement that the bit
>>  pattern be identical to what was sent by the client during the
>>  commit.
> 
> You say that the server expands the data before storing it. This is
> for a server-side only solution, I assume.

Yes, I'm talking about the server-side-only solution, which is one of the hypothetical solutions that we are discussing and comparing.

> But even if there would
> be no problems with the construction/reconstruction, it would be a
> bad solution, IMHO. Indeed, for a commit, it is the client that is
> supposed to expand the data before sending the diff to the server,

What do you mean "the client [...] is supposed to expand the data"?  I don't understand why you think the client is "supposed" to do such a thing.

> and for an update, it is the client that is supposed to recompress
> the data before storing it to the WC. Actually, the server doesn't
> need to know how the file was compressed, it just needs to record
> information about the compression (but doesn't need to know what
> this means exactly).
> 
>>  That point _is_ specific to a server-side solution.  With a
>>  client-side solution, the user's word processor may not mind if a
>>  versioning operation such as a commit (through a decompressing
>>  plug-in) followed by checkout (through a re-compressing plug-in)
>>  changes the bit pattern of the compressed file, so long as the
>>  uncompressed content that it represents is unchanged.
> 
> I disagree.

It's not clear what you disagree with.

> The word processor may not mind (in theory, because
> in practice, one may have bugs that depend on the bit pattern,
> and it would be bad to expose the user to such kind of bugs and
> non-deterministic behavior), but for the user this may be important.
> For instance, a different bit pattern will break a possible signature
> on the compressed file.

I agree that it *may* be important for the user, but the users have control so they can use this client-side scheme in scenarios where it works for them and not use it in other scenarios.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

Hi Julian,

On 2013-03-05 13:30:28 +0000, Julian Foad wrote:
> Vincent Lefevre wrote:
> 
> > On 2013-03-01 14:58:10 +0000, Philip Martin wrote:
> >>  A server-side solution is difficult.  Suppose the client has some
> >>  uncompressed content U which it compresses to C and sends to the server.
> >>  The server can uncompress C to get U but unless the compression scheme
> >>  has a canonical compressed form, with no other forms allowed, the server
> >>  cannot avoid storing C because there is no guarantee that C can be
> >>  reconstructed from U.
> > 
> > This is not specific to server side. Even on the client side, the
> > reconstruction may not be always possible, e.g. if the system is
> > upgraded or if NFS is used. And the compression level may need to
> > be detected or provided in some way.
> 
> Hi Vincent.  I'm not sure you understood Philip's point.

This should be more clear about what I meant below. What I'm saying is
that whether this is done entirely on the server side (a bad solution,
IMHO) or on the client side (see below why), the problems are similar.

> His point is (correct me if I'm wrong) that Subversion's design
> requires that during a checkout or update, the server must
> reconstruct a file containing exactly the same bit pattern that the
> client sent when committing the file.  Compression schemes in
> general don't guarantee that expanding and then compressing will
> produce the same compressed bit pattern, even if you take care to
> use the same "compression level".  Therefore, the server cannot
> simply expand the data before storing it and then re-compress it
> during checkout or update, because, although the resulting
> compressed file would be a valid representation of the user's data,
> it would not satisfy Subversion's own requirement that the bit
> pattern be identical to what was sent by the client during the
> commit.

You say that the server expands the data before storing it. This is
for a server-side only solution, I assume. But even if there would
be no problems with the construction/reconstruction, it would be a
bad solution, IMHO. Indeed, for a commit, it is the client that is
supposed to expand the data before sending the diff to the server,
and for an update, it is the client that is supposed to recompress
the data before storing it to the WC. Actually, the server doesn't
need to know how the file was compressed, it just needs to record
information about the compression (but doesn't need to know what
this means exactly).

> That point _is_ specific to a server-side solution.  With a
> client-side solution, the user's word processor may not mind if a
> versioning operation such as a commit (through a decompressing
> plug-in) followed by checkout (through a re-compressing plug-in)
> changes the bit pattern of the compressed file, so long as the
> uncompressed content that it represents is unchanged.

I disagree. The word processor may not mind (in theory, because
in practice, one may have bugs that depend on the bit pattern,
and it would be bad to expose the user to such kind of bugs and
non-deterministic behavior), but for the user this may be important.
For instance, a different bit pattern will break a possible signature
on the compressed file.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

Vincent Lefevre wrote:

> On 2013-03-01 14:58:10 +0000, Philip Martin wrote:
>>  A server-side solution is difficult.  Suppose the client has some
>>  uncompressed content U which it compresses to C and sends to the server.
>>  The server can uncompress C to get U but unless the compression scheme
>>  has a canonical compressed form, with no other forms allowed, the server
>>  cannot avoid storing C because there is no guarantee that C can be
>>  reconstructed from U.
> 
> This is not specific to server side. Even on the client side, the
> reconstruction may not be always possible, e.g. if the system is
> upgraded or if NFS is used. And the compression level may need to
> be detected or provided in some way.

Hi Vincent.  I'm not sure you understood Philip's point.  His point is (correct me if I'm wrong) that Subversion's design requires that during a checkout or update, the server must reconstruct a file containing exactly the same bit pattern that the client sent when committing the file.  Compression schemes in general don't guarantee that expanding and then compressing will produce the same compressed bit pattern, even if you take care to use the same "compression level".  Therefore, the server cannot simply expand the data before storing it and then re-compress it during checkout or update, because, although the resulting compressed file would be a valid representation of the user's data, it would not satisfy Subversion's own requirement that the bit pattern be identical to what was sent by the client during the commit.

That point _is_ specific to a server-side solution.  With a client-side solution, the user's word processor may not mind if a versioning operation such as a commit (through a decompressing plug-in) followed by checkout (through a re-compressing plug-in) changes the bit pattern of the compressed file, so long as the uncompressed content that it represents is unchanged.

Of course other things can always go wrong, such as not having the right software installed (is that what you mean by "the system is upgraded"?), but that's true of all computing tasks, not specific to this one.  I don't know what you mean about using NFS.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2013-03-01 14:58:10 +0000, Philip Martin wrote:
> A server-side solution is difficult.  Suppose the client has some
> uncompressed content U which it compresses to C and sends to the server.
> The server can uncompress C to get U but unless the compression scheme
> has a canonical compressed form, with no other forms allowed, the server
> cannot avoid storing C because there is no guarantee that C can be
> reconstructed from U.

This is not specific to server side. Even on the client side, the
reconstruction may not be always possible, e.g. if the system is
upgraded or if NFS is used. And the compression level may need to
be detected or provided in some way.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: FSFS format7 and compressed XML bundles

Posted by Philip Martin <ph...@wandisco.com>.

Julian Foad <ju...@btopenworld.com> writes:

> Yes, a client-side plug-in -- either to Subversion or to OpenOffice --
> seems to me the best practical solution.

A server-side solution is difficult.  Suppose the client has some
uncompressed content U which it compresses to C and sends to the server.
The server can uncompress C to get U but unless the compression scheme
has a canonical compressed form, with no other forms allowed, the server
cannot avoid storing C because there is no guarantee that C can be
reconstructed from U.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: FSFS format7 and compressed XML bundles

Posted by Julian Foad <ju...@btopenworld.com>.

Ben Reser wrote:

> Speaking with Julian here at ApacheCon he mentioned that gzip has a
> rsyncable option.  Looking into this turns out that there is a patch
> applied to Debian's gzip that provides this option.  It resets the
> compression algorithm every 1000 bytes and thus makes blocks that can

Use of such a zip format would be ideal -- Subversion's binary-delta would then calculate an excellent delta as long as each inserted chunk is are smaller than the delta window size (currently 100 KB, Stefan's proposal 1 MB).

I'm not sure about the details of how the restartable compression works, but it somehow selects points in the uncompressed data that don't depend on the absolute byte offset from the start of the file, and resets the compression at those points.

As I understand it, only the compressor needs the special logic, and the resulting compressed file is still in the same format and fully compatible with the standard decompression libraries.

But unfortunately although patches for this "restartable" or "rsyncable" mode of compression has been around for years, and it can have a very low overhead, nevertheless it doesn't yet seem to have been implemented in the common compression libraries (such as zlib), and OpenOffice doesn't offer that mode.

Therefore this is not a practical solution at the moment.

> be saved between revisions of the file.  gzip uses the same DEFLATE
> algorithm that most zip files use, so the same idea could be applied
> to it.  If we want to deal with something like this in Subversion, I
> think we'd deal with it via some sort of plugin for specific file
> types that could convert to the more efficient to deltify encoding
> before committing.  Unfortunately, we don't have any sort of plugin
> type infrastructure for this today.

Yes, a client-side plug-in -- either to Subversion or to OpenOffice -- seems to me the best practical solution.

There exists a plug-in to OpenOffice, "OOoSVN", which, when you want to commit the current version of the doc that you are editing, uncompresses the doc file into a tree of files in its own private svn working copy (that it creates in your home directory) and commits that.  Similarly, to update your doc to an old version, or to retrieve two versions and diff them, it updates this hidden WC and then compresses the files in the WC into a ".odt" or whatever, and lets OpenOffice load or diff that file.

I have tried "OOoSVN" and it works but it is very crude -- the user interface is poor and it is not flexible -- it only supports a local dedicated svn repository, for example.

> Even still there are things that can be done today.  I made a couple
> trivial Microsoft Office Word documents.  One with the characters
> "abc" in them and one with "abcdef" in it.  I saved the 
> files in .docx
> and in the 2003 flat XML format.  The .docx file produced a delta of
> 3262 bytes, the .xml format produced a file with a delta of just 358
> bytes.
> 
> OpenOffice/LibreOffice support flat versions of their format (e.g.
> .fodt) that are not compressed and can also be more efficiently stored
> in Subversion.  LibreOffice even has a wiki about this:
> https://wiki.documentfoundation.org/Libreoffice_and_subversion

We should talk to the OpenOffice folks and see if we can convince them of the value of using a restartable compression by default, and find out how possible that is.  It would be great if that Wiki page could even say, "We'd like to use restartable compression for this reason but we need the compression library developers to make it available."

But for a practical solution until restartable compression becomes the norm (if it ever does), if you (Magnus) would like to help by designing some kind of solution, that would be great.  Please do keep discussing it here if you have any thoughts in this direction.  FWIW I think it's an important and interesting issue.

- Julian

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Thu, Feb 28, 2013 at 8:37 AM, Ben Reser <be...@reser.org> wrote:
> I just don't see this happening unless someone has a very clever idea
> that I haven't thought of.

Speaking with Julian here at ApacheCon he mentioned that gzip has a
rsyncable option.  Looking into this turns out that there is a patch
applied to Debian's gzip that provides this option.  It resets the
compression algorithm every 1000 bytes and thus makes blocks that can
be saved between revisions of the file.  gzip uses the same DEFLATE
algorithm that most zip files use, so the same idea could be applied
to it.  If we want to deal with something like this in Subversion, I
think we'd deal with it via some sort of plugin for specific file
types that could convert to the more efficient to deltify encoding
before committing.  Unfortunately, we don't have any sort of plugin
type infrastructure for this today.

Even still there are things that can be done today.  I made a couple
trivial Microsoft Office Word documents.  One with the characters
"abc" in them and one with "abcdef" in it.  I saved the files in .docx
and in the 2003 flat XML format.  The .docx file produced a delta of
3262 bytes, the .xml format produced a file with a delta of just 358
bytes.

OpenOffice/LibreOffice support flat versions of their format (e.g.
.fodt) that are not compressed and can also be more efficiently stored
in Subversion.  LibreOffice even has a wiki about this:
https://wiki.documentfoundation.org/Libreoffice_and_subversion

Re: FSFS format7 and compressed XML bundles

Posted by Vincent Lefevre <vi...@vinc17.net>.

On 2013-02-28 08:37:51 -0800, Ben Reser wrote:
> 3) You'd be saving storage at the expense of using time (read: CPU) on
> every client that's working with those files when checking out.  So
> the end result may be worse than the current problem.

But storage is permanent, while a checkout may occur very rarely.
Of course, the decompress/recompress solution should be a user
choice because this may be bad in some cases, but it can also be
very useful.

-- 
Vincent Lefèvre <vi...@vinc17.net> - Web: <http://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Thu, Feb 28, 2013 at 8:04 AM, Magnus Thor Torfason
<zu...@gmail.com> wrote:
> I've been following the discussion about FSFS format7, and had a question:
> Is there any chance that the format would improve storage efficiency for
> documents that are stored as compressed (zipped) bundles of XML files and
> other resource files (Read MS Office Documents, but OpenOffice is similar).
>
> I'm finding that making very small changes in big documents (with embedded
> images) results in rapid growth of the repository, since the binary diff
> algorithm seems to not be able to figure out efficient deltas for this type
> of documents, even though analysis of the contents shows that they are
> almost unchanged.

I don't think it's in the plan at this point.  The question I have is
how would you imagine that SVN should efficiently store these files?
>From the file system layer I don't think there is a good solution to
this problem.  Since the only way I can see you having efficient
storage is to start manipulating the files e.g. decompressing them for
storage in the repository.  Our file system layer should never start
manipulating the content it's storing.

The only solution I see to this problem and frankly I don't think it's
one we're likely to implement is a client side special handling of
certain mime-types.  Similar to how we do end of line normalization
based on a property, we could decompress these files for storage in
the repo and then re-compress them at the client side.

That said let me explain why I think we'd not be likely to implement this.

1) This would require special handling of certain file formats,
something I don't think we should get into.
2) We might have the dependencies to decompress some formats, but once
we go down this road we'd likely need to pull in more and more exotic
libraries or we'd have to tell people no we won't support this one
format.
3) You'd be saving storage at the expense of using time (read: CPU) on
every client that's working with those files when checking out.  So
the end result may be worse than the current problem.

I just don't see this happening unless someone has a very clever idea
that I haven't thought of.

Re: FSFS format7 and compressed XML bundles

Posted by Ben Reser <be...@reser.org>.

On Thu, Feb 28, 2013 at 12:08 PM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> ZIP - in contrast to .tar.gz - compresses each of these files
> individually and then mainly concatenates them into the
> result file. As long as you don't change the template or
> any of the existing pictures, for instance, larger parts of
> the file should remain unchanged. PowerPoint presentations
> are probably the ones that benefit most from this scheme.
>
> Format7 will (hopefully) be able to deal with a few 100kB of
> inserted / removed data and still find all matching regions.
> This is exactly what we expect from office files: changes
> should affect some of the opaque data blocks but leave
> other ones alone.

In that case you're right it would help.  But it'll only help on the
file level.  I'm not entirely convinced that changes can ever be
limited to just one or a few files with the Office formats.  I'm sure
there are cases that this does happen, but the question is if it
really is enough cases to result in any additional efficiency in
practice.  Obviously, your changes are useful for other reasons than
just Office files.  So I guess we'll see what happens when this is
used in practice.

Re: FSFS format7 and compressed XML bundles

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Thu, Feb 28, 2013 at 5:04 PM, Magnus Thor Torfason <
zulutime.net@gmail.com> wrote:

> Hey all,
>

Sorry that I have to disagree with what most people said.
I guess, Mark got closed to the what the current intend is.

I've been following the discussion about FSFS format7, and had a question:
> Is there any chance that the format would improve storage efficiency for
> documents that are stored as compressed (zipped) bundles of XML files and
> other resource files (Read MS Office Documents, but OpenOffice is similar).
>

Yes, exactly that: There is a *chance* that those will be
stored more efficiently. The thing about this format is that
is they are ZIP-compressed file trees with each file being
something like an embedded picture, the main text body,
the template etc.

ZIP - in contrast to .tar.gz - compresses each of these files
individually and then mainly concatenates them into the
result file. As long as you don't change the template or
any of the existing pictures, for instance, larger parts of
the file should remain unchanged. PowerPoint presentations
are probably the ones that benefit most from this scheme.

Format7 will (hopefully) be able to deal with a few 100kB of
inserted / removed data and still find all matching regions.
This is exactly what we expect from office files: changes
should affect some of the opaque data blocks but leave
other ones alone.

I'm finding that making very small changes in big documents (with embedded
> images) results in rapid growth of the repository, since the binary diff
> algorithm seems to not be able to figure out efficient deltas for this type
> of documents, even though analysis of the contents shows that they are
> almost unchanged.
>

In line with what others already said for this: there will be
no format-specific delta algorithms. This would make SVN
susceptible to attacks by manipulated user data (think of
all the security issues that stem from invalid pictures or
zip files).

The furthest that we might go (not planned, though) is to
have a set of alternative generic compression strategies
plus an equally generic way to choose the best suitable
one among them. Again, that is not planned for format7.

> This may be outside the scope of format7, but I thought I'd ask the
> question nevertheless.
>

No, it's right on the spot. But there will only be general
algorithmic improvements that "happen" to help in your
case.

There is another idea that I had concerning efficient
storage of office files: Templates and corporate ID data
should result in long, identical sub-sections that can be
found in many files. We might be able to identify these
common blocks and store them only once. So far, I
haven't tagged this idea with a target version.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

FSFS format7 and compressed XML bundles

Posted by Magnus Thor Torfason <zu...@gmail.com>.

Hey all,

I've been following the discussion about FSFS format7, and had a 
question: Is there any chance that the format would improve storage 
efficiency for documents that are stored as compressed (zipped) bundles 
of XML files and other resource files (Read MS Office Documents, but 
OpenOffice is similar).

I'm finding that making very small changes in big documents (with 
embedded images) results in rapid growth of the repository, since the 
binary diff algorithm seems to not be able to figure out efficient 
deltas for this type of documents, even though analysis of the contents 
shows that they are almost unchanged.

This may be outside the scope of format7, but I thought I'd ask the 
question nevertheless.

Best,
Magnus

Re: FSFS format7 status and first results

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Mon, Feb 18, 2013 at 1:46 PM, Daniel Shahaf <da...@elego.de> wrote:

> Stefan Fuhrmann wrote on Sat, Feb 16, 2013 at 10:52:21 +0100:
> > Hey all,
> >
> > Just to give you an update on what is going on that branch,
> > here a few facts and numbers.  Bottom line is that there is
> > still a lot to do but the basic assumptions proved correct and
> > significant benefits can already be demonstrated.
>
> Thanks for the update.  Sounds good.  How much has the internal code
> structure changed?  (apart from splitting fs_fs.c)  I'm trying to get
> a handle on the maintainability / bus factor of the new code.
>

Existing code did not change very much (guess: <100 lines
touched). The "diverging" parts can be found in cached_data.c
and transaction.c. pack.c simply selects two disjoint code
paths. I find the refactoring already payed off in that there
e.g. were only a few places, all in 1 file, that I had to hook
to implement a complete data access log.

For format7, however, a lot of new / extra code kicks in and
ultimately we may see a 50 .. 100% increase in FSFS code
size. As soon as 1.8 got release, people may have spare
cycles to discuss design and review code (now that we know
format7 basically works and what should be done to make
it even better).

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: FSFS format7 status and first results

Posted by Daniel Shahaf <da...@elego.de>.

Stefan Fuhrmann wrote on Sat, Feb 16, 2013 at 10:52:21 +0100:
> Hey all,
> 
> Just to give you an update on what is going on that branch,
> here a few facts and numbers.  Bottom line is that there is
> still a lot to do but the basic assumptions proved correct and
> significant benefits can already be demonstrated.

Thanks for the update.  Sounds good.  How much has the internal code
structure changed?  (apart from splitting fs_fs.c)  I'm trying to get
a handle on the maintainability / bus factor of the new code.

Re: FSFS format7 status and first results

Posted by Branko Čibej <br...@wandisco.com>.

On 16.02.2013 22:30, Stefan Fuhrmann wrote:
> '--enable-optimize' is new in 1.8. It should probably be documented
> somewhere but I'm not sure how safe it is to *recommend* it to
> packagers. The optimizations are quite aggressive and might break
> unclean code.

It's about as documented as anything else in our configury. :)

It's mainly intended to allow optimizations when generating debug info,
so --enable-debug --enable-optimize will result in -O1 instead of -O0.

If you use neither of these options, configure's defaults are not affected.

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Re: FSFS format7 status and first results

Posted by Stefan Sperling <st...@elego.de>.

On Thu, Feb 21, 2013 at 11:05:56AM +0100, Stefan Fuhrmann wrote:
> On Mon, Feb 18, 2013 at 5:54 PM, Mark Phippard <ma...@gmail.com> wrote:
> > BTW, how are you managing your branch?  I tried merging it back to
> > trunk to get an idea on the diff and there were a lot of text and tree
> > conflicts.  I thought I had seen you doing synch merges from trunk in
> > the past so I assumed it would merge back.
> >
> 
> Hm. I split fsfs.c and .h into multiple files on the
> branch and the first merge from /trunk required
> significant manual intervention. Since that, merges
> have been clean because fsfs.* wasn't touched
> on /trunk.
> 
> If I understand Julian's merge changes in 1.8,
> reintegrating should work because there has been
> no cherry picking etc.

Julian's merge changes don't affect tree conflict handling at all.

Generally, you'll always get tree conflicts when merging between
branches which have different tree structures in parts of the
branches affected by the merge. So until trunk receives your
refactoring, there will be tree conflicts. (We're not even automatically
handling simple renames during merges yet, so I doubt we'll
automatically handle the split-file scenario anytime soon).

Re: FSFS format7 status and first results

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Thu, Feb 21, 2013 at 12:06 PM, Johan Corveleyn <jc...@gmail.com> wrote:

> On Thu, Feb 21, 2013 at 11:05 AM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
> > On Mon, Feb 18, 2013 at 5:54 PM, Mark Phippard <ma...@gmail.com>
> wrote:
> >> On Sat, Feb 16, 2013 at 4:30 PM, Stefan Fuhrmann
> >> <st...@wandisco.com> wrote:
> ...
> >> > Quite a number of reasons:
> >> >
> >> > * easy setup
> >> > * minimal overhead (I want to get as close to measuring pure
> >> >   FS layer performance as possible)
> >> > * easy to debug and profile
> >>
> >> I get that for development purposes, but I would personally like to
> >> see that the caching etc. is yielding benefits when HTTP is used.
> >
> >
> > Apache should only add constant overhead, i.e. the
> > absolute savings should be roughly the same. Once
> > the cache-server branch is finished, the difference
> > in cache efficiency & effect between svnserve and
> > Apache should be gone.
>
> I guess the question is mainly: how much of the caching benefit will
> be visible to the end-user with mod_dav_svn? Or will it perhaps be
> "hidden" by overhead of HTTPv2 etc ...?
>

Format7 is not so much about caching than real I/O
reduction. It only requires (better) caching to get the
maximum benefit from the new heuristics.

And yes, Apache will come with some extra overhead
when compared to svnserve. In the BDB vs. FSFS
comparison, even adding Apache and working copy
overhead could not mask the fundamental performance
differences between both implementations. Format7
vs. format6 will show an even larger difference on
the lowest level. So, the client *will* see improvements.

> In the first place in a fast LAN (that might be something you can test
> relatively easily), but secondary also in a WAN ... how much
> performance improvement remains when executing particular operations
> ...
>

One operation that will become dramatically faster
is svn log and hopefully svn log -g (and other merge-
related queries). Today, some setups get < 10 revs/s
for a 'svn log -v ^/' simply due to serial disk I/O. Those
setups can be boosted to 100k revs/s. With cold caches.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: FSFS format7 status and first results

Posted by Branko Čibej <br...@wandisco.com>.

On 21.02.2013 12:06, Johan Corveleyn wrote:
> On Thu, Feb 21, 2013 at 11:05 AM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
>> On Mon, Feb 18, 2013 at 5:54 PM, Mark Phippard <ma...@gmail.com> wrote:
>>> On Sat, Feb 16, 2013 at 4:30 PM, Stefan Fuhrmann
>>> <st...@wandisco.com> wrote:
> ...
>>>> Quite a number of reasons:
>>>>
>>>> * easy setup
>>>> * minimal overhead (I want to get as close to measuring pure
>>>>   FS layer performance as possible)
>>>> * easy to debug and profile
>>> I get that for development purposes, but I would personally like to
>>> see that the caching etc. is yielding benefits when HTTP is used.
>>
>> Apache should only add constant overhead, i.e. the
>> absolute savings should be roughly the same. Once
>> the cache-server branch is finished, the difference
>> in cache efficiency & effect between svnserve and
>> Apache should be gone.
> I guess the question is mainly: how much of the caching benefit will
> be visible to the end-user with mod_dav_svn? Or will it perhaps be
> "hidden" by overhead of HTTPv2 etc ...?
>
> In the first place in a fast LAN (that might be something you can test
> relatively easily), but secondary also in a WAN ... how much
> performance improvement remains when executing particular operations
> ...

This kind of question is IMO too simplistic. The real question is, will
500 simultaneous users see a difference? And I think the only reliable
way to get the answer is to find 500 simultaneous users first. :)

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Re: FSFS format7 status and first results

Posted by Johan Corveleyn <jc...@gmail.com>.

On Thu, Feb 21, 2013 at 11:05 AM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> On Mon, Feb 18, 2013 at 5:54 PM, Mark Phippard <ma...@gmail.com> wrote:
>> On Sat, Feb 16, 2013 at 4:30 PM, Stefan Fuhrmann
>> <st...@wandisco.com> wrote:
...
>> > Quite a number of reasons:
>> >
>> > * easy setup
>> > * minimal overhead (I want to get as close to measuring pure
>> >   FS layer performance as possible)
>> > * easy to debug and profile
>>
>> I get that for development purposes, but I would personally like to
>> see that the caching etc. is yielding benefits when HTTP is used.
>
>
> Apache should only add constant overhead, i.e. the
> absolute savings should be roughly the same. Once
> the cache-server branch is finished, the difference
> in cache efficiency & effect between svnserve and
> Apache should be gone.

I guess the question is mainly: how much of the caching benefit will
be visible to the end-user with mod_dav_svn? Or will it perhaps be
"hidden" by overhead of HTTPv2 etc ...?

In the first place in a fast LAN (that might be something you can test
relatively easily), but secondary also in a WAN ... how much
performance improvement remains when executing particular operations
...

-- 
Johan

Re: Merge conflicts and mergeinfo graph problems with FSFS format7 branch [was: FSFS format7 status and first results]

Posted by Branko Čibej <br...@wandisco.com>.

On 21.02.2013 19:15, Mark Phippard wrote:
> On Thu, Feb 21, 2013 at 1:04 PM, Mark Phippard <ma...@gmail.com> wrote:
>
>>> That graph is wrong or at least misleading.  There have been catch-up merges, for example this one:
>>> I don't yet know what's going wrong, but likely something to do with subtree mergeinfo is causing the mergeinfo
>>> graph to think that was not a complete merge.
>> I assumed the mergeinfo graph was wrong, because I recalled him doing
>> those sort of commits.  What I was getting at, was if the merge graph
>> is wrong, then maybe merge itself is having the same issue and not
>> trying to do a reintegrate merge.
> FWIW, if I specifically add --reintegrate to the merge command it runs
> fine and there is just a single text conflict.  So that must mean that
> all of the reintegrate checks for subtree mergeinfo passed.
>
> It sounds like when I just ran "svn merge" it did not pick the correct
> merge strategy.

That's a great test case for Julian to look at before we branch. :)

-- Brane

-- 
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com

Re: Merge conflicts and mergeinfo graph problems with FSFS format7 branch [was: FSFS format7 status and first results]

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 21, 2013 at 1:04 PM, Mark Phippard <ma...@gmail.com> wrote:

>> That graph is wrong or at least misleading.  There have been catch-up merges, for example this one:
>
>> I don't yet know what's going wrong, but likely something to do with subtree mergeinfo is causing the mergeinfo
>> graph to think that was not a complete merge.
>
> I assumed the mergeinfo graph was wrong, because I recalled him doing
> those sort of commits.  What I was getting at, was if the merge graph
> is wrong, then maybe merge itself is having the same issue and not
> trying to do a reintegrate merge.

FWIW, if I specifically add --reintegrate to the merge command it runs
fine and there is just a single text conflict.  So that must mean that
all of the reintegrate checks for subtree mergeinfo passed.

It sounds like when I just ran "svn merge" it did not pick the correct
merge strategy.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: Merge conflicts and mergeinfo graph problems with FSFS format7 branch [was: FSFS format7 status and first results]

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 21, 2013 at 11:54 AM, Julian Foad
<ju...@btopenworld.com> wrote:
> Mark Phippard wrote:
>
>> On Thu, Feb 21, 2013 at 5:05 AM, Stefan Fuhrmann
>> <st...@wandisco.com> wrote:
>>
>>>>  BTW, how are you managing your branch?  I tried merging it back to
>>>>  trunk to get an idea on the diff and there were a lot of text and tree
>>>>  conflicts.  I thought I had seen you doing synch merges from trunk in
>>>>  the past so I assumed it would merge back.
>>>
>>>
>>>  Hm. I split fsfs.c and .h into multiple files on the
>>>  branch and the first merge from /trunk required
>>>  significant manual intervention. Since that, merges
>>>  have been clean because fsfs.* wasn't touched
>>>  on /trunk.
>>>
>>>  If I understand Julian's merge changes in 1.8,
>>>  reintegrating should work because there has been
>>>  no cherry picking etc.
>>
>> I see this when using 1.8:
>>
>> $ svn mergeinfo ^/subversion/branches/fsfs-format7
>>     youngest common ancestor
>>     |         last full merge
>>     |         |        tip of branch
>>     |         |        |         repository path
>>
>>     1414755            1448697
>>     |                  |
>>        --| |------------         subversion/branches/fsfs-format7
>>       /
>>      /
>>   -------| |------------         subversion/trunk
>>                        |
>>                        1447423
>>
>>
>> It seems to imply that it does not think you have ever synched with
>> trunk.  So maybe it is trying to replay every revision from your
>> branch when I merge back and that is why it gets so many conflicts?
>
> That graph is wrong or at least misleading.  There have been catch-up merges, for example this one:

> I don't yet know what's going wrong, but likely something to do with subtree mergeinfo is causing the mergeinfo
> graph to think that was not a complete merge.

I assumed the mergeinfo graph was wrong, because I recalled him doing
those sort of commits.  What I was getting at, was if the merge graph
is wrong, then maybe merge itself is having the same issue and not
trying to do a reintegrate merge.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Merge conflicts and mergeinfo graph problems with FSFS format7 branch [was: FSFS format7 status and first results]

Posted by Julian Foad <ju...@btopenworld.com>.

Mark Phippard wrote:

> On Thu, Feb 21, 2013 at 5:05 AM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
> 
>>>  BTW, how are you managing your branch?  I tried merging it back to
>>>  trunk to get an idea on the diff and there were a lot of text and tree
>>>  conflicts.  I thought I had seen you doing synch merges from trunk in
>>>  the past so I assumed it would merge back.
>> 
>> 
>>  Hm. I split fsfs.c and .h into multiple files on the
>>  branch and the first merge from /trunk required
>>  significant manual intervention. Since that, merges
>>  have been clean because fsfs.* wasn't touched
>>  on /trunk.
>> 
>>  If I understand Julian's merge changes in 1.8,
>>  reintegrating should work because there has been
>>  no cherry picking etc.
> 
> I see this when using 1.8:
> 
> $ svn mergeinfo ^/subversion/branches/fsfs-format7
>     youngest common ancestor
>     |         last full merge
>     |         |        tip of branch
>     |         |        |         repository path
> 
>     1414755            1448697
>     |                  |
>        --| |------------         subversion/branches/fsfs-format7
>       /
>      /
>   -------| |------------         subversion/trunk
>                        |
>                        1447423
> 
> 
> It seems to imply that it does not think you have ever synched with
> trunk.  So maybe it is trying to replay every revision from your
> branch when I merge back and that is why it gets so many conflicts?

That graph is wrong or at least misleading.  There have been catch-up merges, for example this one:

svn log -r1445479
------------------------------------------------------------------------
r1445479 | stefan2 | 2013-02-13 01:37:54 -0500 (Wed, 13 Feb 2013) | 2 lines

On the fsfs-format7: ketchup merge from /trunk up to and including r1445080.
No conflicts to revolve.
------------------------------------------------------------------------
$ svn diff --depth=empty -c1445479 ^/subversion/branches/fsfs-format7 
Index: .
===================================================================
--- .    (revision 1445478)
+++ .    (revision 1445479)

Property changes on: .
___________________________________________________________________
Modified: svn:mergeinfo
   Merged /subversion/trunk:r1442090-1445080

(and lots of other changes were made in that revision, too, as expected).

I don't yet know what's going wrong, but likely something to do with subtree mergeinfo is causing the mergeinfo graph to think that was not a complete merge.

- Julian

Re: FSFS format7 status and first results

Posted by Mark Phippard <ma...@gmail.com>.

On Thu, Feb 21, 2013 at 5:05 AM, Stefan Fuhrmann
<st...@wandisco.com> wrote:

>> BTW, how are you managing your branch?  I tried merging it back to
>> trunk to get an idea on the diff and there were a lot of text and tree
>> conflicts.  I thought I had seen you doing synch merges from trunk in
>> the past so I assumed it would merge back.
>
>
> Hm. I split fsfs.c and .h into multiple files on the
> branch and the first merge from /trunk required
> significant manual intervention. Since that, merges
> have been clean because fsfs.* wasn't touched
> on /trunk.
>
> If I understand Julian's merge changes in 1.8,
> reintegrating should work because there has been
> no cherry picking etc.

I see this when using 1.8:

$ svn mergeinfo ^/subversion/branches/fsfs-format7
    youngest common ancestor
    |         last full merge
    |         |        tip of branch
    |         |        |         repository path

    1414755            1448697
    |                  |
       --| |------------         subversion/branches/fsfs-format7
      /
     /
  -------| |------------         subversion/trunk
                       |
                       1447423


It seems to imply that it does not think you have ever synched with
trunk.  So maybe it is trying to replay every revision from your
branch when I merge back and that is why it gets so many conflicts?


-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: FSFS format7 status and first results

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Mon, Feb 18, 2013 at 5:54 PM, Mark Phippard <ma...@gmail.com> wrote:

> On Sat, Feb 16, 2013 at 4:30 PM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
> > On Sat, Feb 16, 2013 at 5:47 PM, Mark Phippard <ma...@gmail.com>
> wrote:
> >>
> >> On Sat, Feb 16, 2013 at 4:52 AM, Stefan Fuhrmann
> >> <st...@wandisco.com> wrote:
> >> > Hey all,
> >> >
> >> > Just to give you an update on what is going on that branch,
> >> > here a few facts and numbers.  Bottom line is that there is
> >> > still a lot to do but the basic assumptions proved correct and
> >> > significant benefits can already be demonstrated.
> >> >
> >> > * about 20% of the coding is done so far
> >> > * some core features implemented:
> >> >   logical addressing, reorg upon pack, block read
> >>
> >> What do you mean by pack here?  Is it svnadmin pack?
> >
> >
> > svnadmin pack
> >
> >>
> >> Is that in any way an essential part of the performance boost?
> >
> >
> > Yes. It will places items (noderevs, representations, change lists)
> > next to each other when they will likely be requested shortly
> > after one another. For instance, try to concatenate all elements
> > of a deltification chain.
> >
> >>
> >> Or are your format7 repositories always packed?
> >
> >
> > They are not. Unpacked revisions will see a performance hit from
> > reading the two extra index files per revision and a boost from
> > block-read which will often fetch the whole revision with a single
> > I/O operation.
>
> So is the main difference between format 6 and 7 how the data is
> organized when they are packed?
>

Currently, yes. Plus the ability to read data from
an arbitrary data block: for every position within
a rev / pack file, we now know what data that is
an can read it directly without DAG traversal etc.

Thus, we now try to hit any block in a RAID system
only once. However, there are limitations to our
caching heuristics that will make this hard to achieve
in some scenarios. Further work will address this in
two ways: improve short-term caching hit rates to
quasi 100% and reduce the number of items to cache.

The latter requires further changes to the on-disk
representation of data: We need to bundle them into
larger blocks ("containers"). As a nice side-effect,
we will safe another 30 .. 50% of disk space.

> > Quite a number of reasons:
> >
> > * easy setup
> > * minimal overhead (I want to get as close to measuring pure
> >   FS layer performance as possible)
> > * easy to debug and profile
>
> I get that for development purposes, but I would personally like to
> see that the caching etc. is yielding benefits when HTTP is used.
>

Apache should only add constant overhead, i.e. the
absolute savings should be roughly the same. Once
the cache-server branch is finished, the difference
in cache efficiency & effect between svnserve and
Apache should be gone.

>
> > '--enable-optimize' is new in 1.8. It should probably be documented
> > somewhere but I'm not sure how safe it is to *recommend* it to
> > packagers. The optimizations are quite aggressive and might break
> > unclean code.
> >
> > I used it in conjunction with '-march=native' to minimize CPU time
> > vs. I/O time. It saved 3 .. 5% of CPU cycles in my tests.
>
> OK.
>
> BTW, how are you managing your branch?  I tried merging it back to
> trunk to get an idea on the diff and there were a lot of text and tree
> conflicts.  I thought I had seen you doing synch merges from trunk in
> the past so I assumed it would merge back.
>

Hm. I split fsfs.c and .h into multiple files on the
branch and the first merge from /trunk required
significant manual intervention. Since that, merges
have been clean because fsfs.* wasn't touched
on /trunk.

If I understand Julian's merge changes in 1.8,
reintegrating should work because there has been
no cherry picking etc.

-- Stefan^2

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: FSFS format7 status and first results

Posted by Mark Phippard <ma...@gmail.com>.

On Sat, Feb 16, 2013 at 4:30 PM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> On Sat, Feb 16, 2013 at 5:47 PM, Mark Phippard <ma...@gmail.com> wrote:
>>
>> On Sat, Feb 16, 2013 at 4:52 AM, Stefan Fuhrmann
>> <st...@wandisco.com> wrote:
>> > Hey all,
>> >
>> > Just to give you an update on what is going on that branch,
>> > here a few facts and numbers.  Bottom line is that there is
>> > still a lot to do but the basic assumptions proved correct and
>> > significant benefits can already be demonstrated.
>> >
>> > * about 20% of the coding is done so far
>> > * some core features implemented:
>> >   logical addressing, reorg upon pack, block read
>>
>> What do you mean by pack here?  Is it svnadmin pack?
>
>
> svnadmin pack
>
>>
>> Is that in any way an essential part of the performance boost?
>
>
> Yes. It will places items (noderevs, representations, change lists)
> next to each other when they will likely be requested shortly
> after one another. For instance, try to concatenate all elements
> of a deltification chain.
>
>>
>> Or are your format7 repositories always packed?
>
>
> They are not. Unpacked revisions will see a performance hit from
> reading the two extra index files per revision and a boost from
> block-read which will often fetch the whole revision with a single
> I/O operation.

So is the main difference between format 6 and 7 how the data is
organized when they are packed?

> Quite a number of reasons:
>
> * easy setup
> * minimal overhead (I want to get as close to measuring pure
>   FS layer performance as possible)
> * easy to debug and profile

I get that for development purposes, but I would personally like to
see that the caching etc. is yielding benefits when HTTP is used.


> '--enable-optimize' is new in 1.8. It should probably be documented
> somewhere but I'm not sure how safe it is to *recommend* it to
> packagers. The optimizations are quite aggressive and might break
> unclean code.
>
> I used it in conjunction with '-march=native' to minimize CPU time
> vs. I/O time. It saved 3 .. 5% of CPU cycles in my tests.

OK.

BTW, how are you managing your branch?  I tried merging it back to
trunk to get an idea on the diff and there were a lot of text and tree
conflicts.  I thought I had seen you doing synch merges from trunk in
the past so I assumed it would merge back.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: FSFS format7 status and first results

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Sat, Feb 16, 2013 at 5:47 PM, Mark Phippard <ma...@gmail.com> wrote:

> On Sat, Feb 16, 2013 at 4:52 AM, Stefan Fuhrmann
> <st...@wandisco.com> wrote:
> > Hey all,
> >
> > Just to give you an update on what is going on that branch,
> > here a few facts and numbers.  Bottom line is that there is
> > still a lot to do but the basic assumptions proved correct and
> > significant benefits can already be demonstrated.
> >
> > * about 20% of the coding is done so far
> > * some core features implemented:
> >   logical addressing, reorg upon pack, block read
>
> What do you mean by pack here?  Is it svnadmin pack?


svnadmin pack


> Is that in any way an essential part of the performance boost?


Yes. It will places items (noderevs, representations, change lists)
next to each other when they will likely be requested shortly
after one another. For instance, try to concatenate all elements
of a deltification chain.


> Or are your format7 repositories always packed?
>

They are not. Unpacked revisions will see a performance hit from
reading the two extra index files per revision and a boost from
block-read which will often fetch the whole revision with a single
I/O operation.

 > * format 7 repos are ~3x faster due to reduced I/O
> > * format 6 repos get faster by ~10%
>
> When you talk about format6 and branch6 and compare to trunk, what do
> you mean?  Is that just how the format6 repository, which is already
> in trunk, will fare with improved caching that is in your branch?
>

Yes. Improved caching and optimized reporting order (i.e. in what
order will an export / checkout read the tree) apply to existing format6
repos as well.


> > * format 7 performance hit by cache inefficiency;
> >   will be addressed by 'container' objects plus a
> >   change in the membuffer implementation
> > * on-disk format still in the flux and will remain
> >   so for a while
> >
> > Tests were run against a local copy of the TSVN repo with
> > revprop compression, directory deltification and property
> > deltification enabled to bring format 6 structurally as
> > close to format 7 defaults as possible.
> >
> > All values are given in user-time triples for trunk@1446787
> > on format 6, fsfs-format7@1446862 on format 6 and format 7.
> > "hot" runs mean "from OS cache"; svnserve was restarted
> > before every test run.
> >
> > $ time svnadmin verify -M 4000 -q $repo
> >
> > medium      trunk / branch6 / branch7
> > USB cold   270.5s / 246.3s / 104.4s   = 1.00 : 1.10 : 2.59
> > USB hot    248.0s / 191.0s /  60.8s   = 1.00 : 1.30 : 4.08
> > SSD cold    66.1s /  62.0s /  59.4s   = 1.00 : 1.07 : 1.11
> > SSD hot     63.2s /  60.0s /  57.1s   = 1.00 : 1.05 : 1.11
> >
> > $ time svnbench null-export svn://localhost/tsvn/trunk -q
>
> Any reason you test with svn:// and not http://.  I feel like the
> latter is the most widely used server by a wide margin.
>

Quite a number of reasons:

* easy setup
* minimal overhead (I want to get as close to measuring pure
  FS layer performance as possible)
* easy to debug and profile


>  > USB cold   44.29s / 39.63s / 16.24s   = 1.00 : 1.12 : 2.73
> > USB hot    10.68s / 10.25s /  3.78s   = 1.00 : 1.04 : 2.83
> > SSD cold    5.72s /  5.00s /  3.75s   = 1.00 : 1.14 : 1.53
> > SSD hot     2.37s /  2.38s /  3.21s   = 1.00 : 1.00 : 0.74
> >
> > $ time svnbench null-log svn://localhost/tsvn/trunk -v -q
> >
> > USB cold   54.36s / 50.17s /  8.73s   = 1.00 : 1.06 : 6.11
> > USB hot    43.64s / 36.46s /  3.52s   = 1.00 : 1.20 :12.40
> > SSD cold    9.32s / 10.60s /  3.22s   = 1.00 : 0.88 : 2.89
> > SSD hot     2.36s /  2.28s /  2.88s   = 1.00 : 1.04 : 0.82
> >
> > $ time svnbench null-log svn://localhost/tsvn/trunk -v -g -q
> >
> > USB cold   98.02s / 87.01s / 23.74s   = 1.00 : 1.13 : 4.13
> > USB hot    69.88s / 57.14s /  7.88s   = 1.00 : 1.22 : 8.87
> > SSD cold    8.35s / 10.50s /  8.16s   = 1.00 : 0.80 : 1.02
> > SSD hot     5.94s /  5.72s /  6.39s   = 1.00 : 1.04 : 0.93
> >
> > Tests have been conducted with maximum optimization:
> >
> >   ./configure --disable-shared --disable-debug --enable-optimize \
> >   --without-berkeley-db -without-serf CUSERFLAGS='-march=native'
>
> Do we have, or will we have, a document or wiki that suggest
> optimization flags for packagers?
>

'--enable-optimize' is new in 1.8. It should probably be documented
somewhere but I'm not sure how safe it is to *recommend* it to
packagers. The optimizations are quite aggressive and might break
unclean code.

I used it in conjunction with '-march=native' to minimize CPU time
vs. I/O time. It saved 3 .. 5% of CPU cycles in my tests.

-- Stefan^2.

-- 
Certified & Supported Apache Subversion Downloads:
*

http://www.wandisco.com/subversion/download
*

Re: FSFS format7 status and first results

Posted by Mark Phippard <ma...@gmail.com>.

On Sat, Feb 16, 2013 at 4:52 AM, Stefan Fuhrmann
<st...@wandisco.com> wrote:
> Hey all,
>
> Just to give you an update on what is going on that branch,
> here a few facts and numbers.  Bottom line is that there is
> still a lot to do but the basic assumptions proved correct and
> significant benefits can already be demonstrated.
>
> * about 20% of the coding is done so far
> * some core features implemented:
>   logical addressing, reorg upon pack, block read

What do you mean by pack here?  Is it svnadmin pack?  Is that in any
way an essential part of the performance boost?  Or are your format7
repositories always packed?

> * format 7 repos are ~3x faster due to reduced I/O
> * format 6 repos get faster by ~10%

When you talk about format6 and branch6 and compare to trunk, what do
you mean?  Is that just how the format6 repository, which is already
in trunk, will fare with improved caching that is in your branch?


> * format 7 performance hit by cache inefficiency;
>   will be addressed by 'container' objects plus a
>   change in the membuffer implementation
> * on-disk format still in the flux and will remain
>   so for a while
>
> Tests were run against a local copy of the TSVN repo with
> revprop compression, directory deltification and property
> deltification enabled to bring format 6 structurally as
> close to format 7 defaults as possible.
>
> All values are given in user-time triples for trunk@1446787
> on format 6, fsfs-format7@1446862 on format 6 and format 7.
> "hot" runs mean "from OS cache"; svnserve was restarted
> before every test run.
>
> $ time svnadmin verify -M 4000 -q $repo
>
> medium      trunk / branch6 / branch7
> USB cold   270.5s / 246.3s / 104.4s   = 1.00 : 1.10 : 2.59
> USB hot    248.0s / 191.0s /  60.8s   = 1.00 : 1.30 : 4.08
> SSD cold    66.1s /  62.0s /  59.4s   = 1.00 : 1.07 : 1.11
> SSD hot     63.2s /  60.0s /  57.1s   = 1.00 : 1.05 : 1.11
>
> $ time svnbench null-export svn://localhost/tsvn/trunk -q

Any reason you test with svn:// and not http://.  I feel like the
latter is the most widely used server by a wide margin.

> USB cold   44.29s / 39.63s / 16.24s   = 1.00 : 1.12 : 2.73
> USB hot    10.68s / 10.25s /  3.78s   = 1.00 : 1.04 : 2.83
> SSD cold    5.72s /  5.00s /  3.75s   = 1.00 : 1.14 : 1.53
> SSD hot     2.37s /  2.38s /  3.21s   = 1.00 : 1.00 : 0.74
>
> $ time svnbench null-log svn://localhost/tsvn/trunk -v -q
>
> USB cold   54.36s / 50.17s /  8.73s   = 1.00 : 1.06 : 6.11
> USB hot    43.64s / 36.46s /  3.52s   = 1.00 : 1.20 :12.40
> SSD cold    9.32s / 10.60s /  3.22s   = 1.00 : 0.88 : 2.89
> SSD hot     2.36s /  2.28s /  2.88s   = 1.00 : 1.04 : 0.82
>
> $ time svnbench null-log svn://localhost/tsvn/trunk -v -g -q
>
> USB cold   98.02s / 87.01s / 23.74s   = 1.00 : 1.13 : 4.13
> USB hot    69.88s / 57.14s /  7.88s   = 1.00 : 1.22 : 8.87
> SSD cold    8.35s / 10.50s /  8.16s   = 1.00 : 0.80 : 1.02
> SSD hot     5.94s /  5.72s /  6.39s   = 1.00 : 1.04 : 0.93
>
> Tests have been conducted with maximum optimization:
>
>   ./configure --disable-shared --disable-debug --enable-optimize \
>   --without-berkeley-db -without-serf CUSERFLAGS='-march=native'

Do we have, or will we have, a document or wiki that suggest
optimization flags for packagers?

Thanks for sharing.  The results look nice.


Mark Phippard
http://markphip.blogspot.com/