You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by "B. Smith-Mannschott" <be...@gmail.com> on 2008/03/08 14:01:12 UTC

Re: Evaluating SVN as a Document Management Solution

On Mar 8, 2008, at 00:50, Tom Blough wrote:
>>> typically in MS Office applications, deliverables are typically
>>> drawings in AutoCAD or Microstation, and database content
>> is typically
>>> financial data from which reports are generated.
>
> For your application, your repository will be huge.  All of the file  
> types
> you mention are binary.  Therefore, SVN cannot calculate a diff on  
> the file
> and will end up storing a copy of the complete file.

This is incorrect.  You're probably thinking of CVS or some similarly  
brain-damaged revision control system.  SVN uses compressed binary  
differences between versions for storage in its repository.

This works well for text of course. It also works well for binary  
formats which don't themselves use compression, such as Microsoft  
Word's DOC, uncompressed TIFF, ...

> There was a recent thread concerning using XML data formats for newer
> versions of Office in order to save diff content, but that can cause
> problems due to the fact that XML is not order specific.  Office  
> can, and
> does, generate different XML for the same document.

Well, yes, that will tend to make your differences larger than they  
have to be.  The real problem however is that most of these "XML"  
formats are not, in fact, XML but rather XML compressed within a ZIP  
archive.

Where Subversions binary differencing and compression fails is on file  
formats that are themselves compressed (OpenDocument, OfficeOpenXML,  
PNG, GIF, JPG, ...).  Because of the compression, even a small change  
in the document may cause it's representation on disk to change  
completely.  The difference algorithm can't "see through" this.   
Furthermore, subversion's built-in compression (like any compression  
algorithm) won't be able to further compress something that's already  
compressed.

I've done an experiment to verify this.  I set up three repositories  
each containing a single document in one of three formats.  In this  
case, I used the text of _The Count of Monte Cristo_ from Project  
Gutenberg as ASCII Text (2568 KB), as Microsoft Word DOC (6384 KB) and  
as OpenOffice ODT (1060 KB).  I created 8 variants of each of these  
documents (inserting or removing a paragraph here or there) to  
represent minor edits. I then made 80 commits to each of the three  
repositories drawing upon the aforementioned 8 variants in round-robin  
fashion to simulate a history of 80 minor edits made and committed.   
While doing this I kept track of the total size of the repository.

* All three repositories grow linearly in size, but the ODT repository  
grows more quickly (steeper slope).

* The ODT repository is smallest for the first few commits but quickly  
out grows the TXT and DOC repositories.

* The DOC repository is larger than the TXT repository and grows  
slightly faster in comparison.

* The size difference between the TXT and DOC repositories is not as  
large as the relative size of the formats (2568 KB vs 6384 KB) might  
suggest. DOC may be twice as large as TXT but much of this difference  
is redundancy which SVN is quite capable of compressing away.

* Final repository sizes after 80 commits: TXT = 10052 KB; DOC = 16288  
KB; ODT = 58260 KB.

See also attached PNG.

// Ben Smith-Mannschott


Re: Evaluating SVN as a Document Management Solution

Posted by Thomas Harold <tg...@tgharold.com>.
That matches what we've seen as well.  We store a lot of MS Access 
databases (.MDB files).

Under the old system (VSS+SourceOffSite), every time we had to check-in 
a copy of a 200MB MDB, all of it would be transmitted and the new 
revision would add 200MB to the repository.  So we were forced to pack 
the MDBs into ZIP files prior to committing them.  Which was a PITA, but 
shaved our sizes by 90% typically.  So to commit a 200MB MDB, typically 
required shoving 20MB across the wire and increased the repository size 
by 20MB... even if we had only changed a handful of records in the MDB.

With SVN/TSVN, we no longer bother packing the MDBs in ZIP files.  Nor 
do we worry about small changes and frequent commits.  The first version 
of a 200MB MDB will grow the repository by 66-100MB (figure either 3:1 
or 2:1 compression by the SVN engine... we haven't tested).  After that, 
figure commits are related to the amount of change within the MDB.  If 
we only changed a handful of records and didn't do something silly like 
compacting the MDB afterwards, the commit might only be 1MB worth of 
traffic.

In summary, between the binary diff algorithm and only shoving changes 
across the wire, our bandwidth consumption when dealing with MDBs has 
gone down drastically.  And the repository grows at a sane pace rather 
then the increased pace of storing ZIP files.

Of course, if the content of the binary files are already compressed in 
some fashion, you'll likely not see much gain from sending only diffs 
across the wire.  (Files like JPGs, TIFFs, zipped up XML files, ZIP 
files, .tar.gz or .tar.bz2.)

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org