You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by "B. Smith-Mannschott" <be...@gmail.com> on 2008/03/08 14:01:12 UTC
Re: Evaluating SVN as a Document Management Solution
On Mar 8, 2008, at 00:50, Tom Blough wrote:
>>> typically in MS Office applications, deliverables are typically
>>> drawings in AutoCAD or Microstation, and database content
>> is typically
>>> financial data from which reports are generated.
>
> For your application, your repository will be huge. All of the file
> types
> you mention are binary. Therefore, SVN cannot calculate a diff on
> the file
> and will end up storing a copy of the complete file.
This is incorrect. You're probably thinking of CVS or some similarly
brain-damaged revision control system. SVN uses compressed binary
differences between versions for storage in its repository.
This works well for text of course. It also works well for binary
formats which don't themselves use compression, such as Microsoft
Word's DOC, uncompressed TIFF, ...
> There was a recent thread concerning using XML data formats for newer
> versions of Office in order to save diff content, but that can cause
> problems due to the fact that XML is not order specific. Office
> can, and
> does, generate different XML for the same document.
Well, yes, that will tend to make your differences larger than they
have to be. The real problem however is that most of these "XML"
formats are not, in fact, XML but rather XML compressed within a ZIP
archive.
Where Subversions binary differencing and compression fails is on file
formats that are themselves compressed (OpenDocument, OfficeOpenXML,
PNG, GIF, JPG, ...). Because of the compression, even a small change
in the document may cause it's representation on disk to change
completely. The difference algorithm can't "see through" this.
Furthermore, subversion's built-in compression (like any compression
algorithm) won't be able to further compress something that's already
compressed.
I've done an experiment to verify this. I set up three repositories
each containing a single document in one of three formats. In this
case, I used the text of _The Count of Monte Cristo_ from Project
Gutenberg as ASCII Text (2568 KB), as Microsoft Word DOC (6384 KB) and
as OpenOffice ODT (1060 KB). I created 8 variants of each of these
documents (inserting or removing a paragraph here or there) to
represent minor edits. I then made 80 commits to each of the three
repositories drawing upon the aforementioned 8 variants in round-robin
fashion to simulate a history of 80 minor edits made and committed.
While doing this I kept track of the total size of the repository.
* All three repositories grow linearly in size, but the ODT repository
grows more quickly (steeper slope).
* The ODT repository is smallest for the first few commits but quickly
out grows the TXT and DOC repositories.
* The DOC repository is larger than the TXT repository and grows
slightly faster in comparison.
* The size difference between the TXT and DOC repositories is not as
large as the relative size of the formats (2568 KB vs 6384 KB) might
suggest. DOC may be twice as large as TXT but much of this difference
is redundancy which SVN is quite capable of compressing away.
* Final repository sizes after 80 commits: TXT = 10052 KB; DOC = 16288
KB; ODT = 58260 KB.
See also attached PNG.
// Ben Smith-Mannschott
Re: Evaluating SVN as a Document Management Solution
Posted by Thomas Harold <tg...@tgharold.com>.
That matches what we've seen as well. We store a lot of MS Access
databases (.MDB files).
Under the old system (VSS+SourceOffSite), every time we had to check-in
a copy of a 200MB MDB, all of it would be transmitted and the new
revision would add 200MB to the repository. So we were forced to pack
the MDBs into ZIP files prior to committing them. Which was a PITA, but
shaved our sizes by 90% typically. So to commit a 200MB MDB, typically
required shoving 20MB across the wire and increased the repository size
by 20MB... even if we had only changed a handful of records in the MDB.
With SVN/TSVN, we no longer bother packing the MDBs in ZIP files. Nor
do we worry about small changes and frequent commits. The first version
of a 200MB MDB will grow the repository by 66-100MB (figure either 3:1
or 2:1 compression by the SVN engine... we haven't tested). After that,
figure commits are related to the amount of change within the MDB. If
we only changed a handful of records and didn't do something silly like
compacting the MDB afterwards, the commit might only be 1MB worth of
traffic.
In summary, between the binary diff algorithm and only shoving changes
across the wire, our bandwidth consumption when dealing with MDBs has
gone down drastically. And the repository grows at a sane pace rather
then the increased pace of storing ZIP files.
Of course, if the content of the binary files are already compressed in
some fashion, you'll likely not see much gain from sending only diffs
across the wire. (Files like JPGs, TIFFs, zipped up XML files, ZIP
files, .tar.gz or .tar.bz2.)
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org