You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by David Kaplan <Da...@ird.fr> on 2008/10/22 15:03:43 UTC

improving subversion treatment of compressed XML/text file formats

Hi,

I use subversion as my personal backup system.  Though I do my share of
coding, a lot of what I put in my subversion database are compressed XML
files (for example, openoffice documents).  Currently, svn treats these
as binary files, leading to a ballooning svn database as there is no
differencing on these files (correct me if I am wrong about this).  For
a while I have been thinking that svn could do a lot better than that
since these are trivially compressed files.  This could reduce
significantly the amount of disk space that versioning these files
requires and improve the ability to see differences between files (e.g.,
conflict resolution).  As these file formats are popping up everywhere
(openoffice, MS Office, ...), it might be worth integrating a third
"type" of file into svn (along with text and binary): compressed-text.
Someone smarter than I might even be able to do this with the current
architecture of hooks with minimal changes to subversion itself, but a
formal integration doesn't seem too hard.

The basic idea would be that when svn adds one of these files, it adds
the full compressed version initially, but thereafter it uncompresses
stored and working copy versions, differences them and just stores these
differences.  The user would specify which file formats to autodetect as
compressed text and the compression algorithm for each file type through
configuration options and svn properties.

One question would be what to do with conflicts, but I think this isn't
a show stopper and a logical behavior can be found.

Cheers,
David




-- 
**********************************
David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html
**********************************


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Henrik Sundberg <st...@gmail.com>.
On Wed, Oct 22, 2008 at 5:03 PM, David Kaplan <Da...@ird.fr> wrote:
> Hi,
>
> I use subversion as my personal backup system.  Though I do my share of
> coding, a lot of what I put in my subversion database are compressed XML
> files (for example, openoffice documents).  Currently, svn treats these
> as binary files, leading to a ballooning svn database as there is no
> differencing on these files (correct me if I am wrong about this).  For
> a while I have been thinking that svn could do a lot better than that
> since these are trivially compressed files.

I do agree with you. There is a tool http://odfsvn.sourceforge.net/
that handles this for odf files (I think).

/$

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 22:01, Henrik Sundberg wrote:

>> Some interesting previous discussion on this topic:
>>
>> http://svn.haxx.se/users/archive-2006-02/0180.shtml
>
> Yes. I suppose that counts for archives with very many files (using
> per file compression) and just a few of them changes.
> Is that really the case for odf files? Isn't it just a few files? Is
> just the "text"-part changed for small changes?

True... though the text part (content.xml) also tends to be by far the  
largest single part:

   [bsmith@Meheadable:~/brz/k/ooexperiment]
   $ ls -sk UC-1002.odt*
   24 UC-1002.odt

   UC-1002.odt_:                  # unpacked version of UC-1002.odt
   total 192
     0 Configurations2   116 content.xml          4 mimetype
     0 META-INF          4 layout-cache           12 settings.xml
     0 Thumbnails        4 meta.xml               52 styles.xml

// Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Henrik Sundberg <st...@gmail.com>.
On Fri, Oct 24, 2008 at 9:27 PM,  <km...@rockwellcollins.com> wrote:
>
> "Henrik Sundberg" <st...@gmail.com> wrote on 10/24/2008 01:44:03 PM:
>> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
>> <bs...@gmail.com> wrote:
>> >> Just one question about merging with ODT-like documents: not being able
>> >> to merge would leave things no worse than binary docs, right?  By this
>> >> I
>> >> mean that SVN does not try to merge changes in binaries currently.
>> >
>> > ODT-like documents are binary documents. Period.
>>
>> I disagree. Svn handles diffs in binary files. Compressed binaries are
>> different.
>
> Some interesting previous discussion on this topic:
>
> http://svn.haxx.se/users/archive-2006-02/0180.shtml

Yes. I suppose that counts for archives with very many files (using
per file compression) and just a few of them changes.
Is that really the case for odf files? Isn't it just a few files? Is
just the "text"-part changed for small changes?

/$

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 20:44, Henrik Sundberg wrote:

> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
> <bs...@gmail.com> wrote:
>>> Just one question about merging with ODT-like documents: not being  
>>> able
>>> to merge would leave things no worse than binary docs, right?  By  
>>> this I
>>> mean that SVN does not try to merge changes in binaries currently.
>>
>> ODT-like documents are binary documents. Period.
>
> I disagree. Svn handles diffs in binary files. Compressed binaries are
> different.

On the client-side it makes sense to distinguish between "text" and  
"binary". The former is composed of lines of mostly ascii characters  
of reasonable length such that the unix commands diff and merge or  
analogs can be used profitably. The latter is not amenable to such  
treatment.

This is the distinction I was making when I called ODT-like documents  
"binary".

On the server-side the distinction is a different one.  The server  
doesn't care the file is textual, in the client sense.  The binary  
differences (deltas) used on the server side just consider files they  
store as a sequence of bytes without structure. (The deltas used by  
the sever don't resemble the line-based diffs that a command like svn  
diff will spit out.)

Because of this, the important distinction is between "delta- 
friendly" [1] and "delta-hostile" files.

[1] I'm just making these words up. Anyone got a better suggestion?

Delta-friendly files exhibit local changes when they are edited.  That  
is, change a few words in your MS Word document, and a few opaque  
blocks of bytes in the resulting file will be changed relative to the  
previous version. Most of the file will remain as it was.

Delta-hostile files are ones where a trivial edit may make many  
changes through the file, possibly even every byte in the file.  
Because of the way most compression algorithms work, compressed files  
are a good example of delta-hostility. (In fact they are something of  
a worst-case: not only does the server have to store the full content,  
it can't even perform compression on the content because said content  
is already compressed.)

Compression doesn't have to be involved, however, for a file to be  
fairly delta-hostile. Consider the XMI (a sort of XML dialect for UML  
diagrams) produced by tools like Rational Modeler, these tend to  
produce large deltas against the previous version because they make  
heavy use of seemingly randomly generated IDs, which change every time  
the file is saved. They also don't seem to write sub-elements out in a  
consistent order.

Not all compressed files are necessarily maximally delta-hostile.  
Consider a JAR. These are typically composed of many individual class  
files. Because of the way the ZIP format (on which JAR is based)  
works, each of these files is compressed separately.  There's every  
reason to expect that those classes which don't change from one  
revision of the jar to another will still compress to the same stream  
of bytes.  Only those class files which have changed will be  
(completely) different.  There is still ample opportunity in such a  
scenario for finding deltas on the server-side.

Hope that's clearer
// Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by km...@rockwellcollins.com.
"Henrik Sundberg" <st...@gmail.com> wrote on 10/24/2008 01:44:03 PM:
> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
> <bs...@gmail.com> wrote:
> >> Just one question about merging with ODT-like documents: not being 
able
> >> to merge would leave things no worse than binary docs, right?  By 
this I
> >> mean that SVN does not try to merge changes in binaries currently.
> >
> > ODT-like documents are binary documents. Period.
> 
> I disagree. Svn handles diffs in binary files. Compressed binaries are
> different.

Some interesting previous discussion on this topic:

http://svn.haxx.se/users/archive-2006-02/0180.shtml

Kevin R.

Re: improving subversion treatment of compressed XML/text file formats

Posted by Henrik Sundberg <st...@gmail.com>.
On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
<bs...@gmail.com> wrote:
>> Just one question about merging with ODT-like documents: not being able
>> to merge would leave things no worse than binary docs, right?  By this I
>> mean that SVN does not try to merge changes in binaries currently.
>
> ODT-like documents are binary documents. Period.

I disagree. Svn handles diffs in binary files. Compressed binaries are
different.

/$

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 19:04, David Kaplan wrote:

 > On Fri, 2008-10-24 at 18:28 +0200, Benjamin Smith-Mannschott wrote:
 > > SVN Repo Space Efficiency when Edited often:

 > > format         space efficiency  merge-friendlyness
 > > =============  ================  ==================
 > > plain text     very good         very good
 > > html           very good         good
 > > flat ODT:      good              poor [1]
 > > msword doc     acceptable        impossible [2]
 > > msword docx    poor              impossible [2]
 > > ODT            poor              impossible [2]
 > > ---------------------------------------------------
 > > [1] This format isn't widely supported (a pitty, really).
 > > [2] SVN will not and should not attempt to merge these
 > > formats as they are not textual. Microsoft-word and
 > > OpenOffice do contain features allowing a user to
 > > perform merges independently of svn within the tool,
 > > it's just that they'd have to do this "by hand" for
 > > every merge conflict.
 > > ===================================================

 > Just one question about merging with ODT-like documents: not being  
able
 > to merge would leave things no worse than binary docs, right?  By  
this I
 > mean that SVN does not try to merge changes in binaries currently.

ODT-like documents are binary documents. Period.

 > So
 > one could still get advantages in space by doing some sort of
 > uncompress-diff process as I previously suggested, but one would  
just be
 > left without the possibility of merging.  Is this correct?

Correct. The entries (files) in a ZIP can be compressed or not. If
there was a way to convince OpenOffice to just store its XML parts to
this zip without compressing them this would be enough to allow
subversion's binary server-side binary diffing algorithm to take
advantage of similarities between file revisions and thus save space.

On Oct 24, 2008, at 19:04, David Kaplan wrote:
 > Also, why is html better at merging than flat ODT?  I would imagine  
that
 > any XML-like format would have problems with blind merging.

I'm continually surprised how little research has apparently been done
on the relative mergability of various formats.  The traditional
3-way-merge operates over a simple sequences of values (lines of
text), identifying common regions and differences.  This works okay
for simple formats but breaks down with formats which have complicated
invariants.

HTML, particularly hand-edited is often simple enough in structure to
allow merging. It's also simple enough for human to fix, at the source
level should there be a conflict. flat ODT? not so much.

For a more detailed take, let me include an excerpt from a feature
proposal I wrote recently for the svn pre-commit hook script at my
place of work:

B. Smith-Mannschott Wrote, in "Subversion Hookscript Requirements  
Proposal":

| * Is XML textual or not?
|
| XML presents yet another complication. It should also most properly be
| application/xml. A mime-type of text/xml would have to specify its
| encoding with a charset on svn:mime-type and this would be redundant
| for XML, since it carries its encoding information in-band.
|
| However, using application/xml for all XML files would be problematic
| because it would prevent subversion from attempting to merge
| changes. Even more annoyingly, it would make it unecessarily
| cumbersome to view changes (svn diff).
|
| One size does not fit all. Some XML files will be edited by hand and
| are structured simply enough that line-based merging has a good chance
| of success and line-based diff produces a useful comparison. Examples
| include XML schemas, XML transformations, maven project object models,
| XHTML pages and many more.
|
| Other XML files are huge dumps of hideously complicated data
| structures with all kinds of complex and undocumented
| constraints. These should be treated as if they were binary files and
| given an application/xml mime type. These kinds of files are rarely
| seen in source form and virtually impossible to diff and merge
| successfully. XMI (Rational Modeler UML) files fall in this category,
| as do the "flat" variants of the OpenOffice.org file formats.

// Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by David Kaplan <Da...@ird.fr>.
Hi,

On Fri, 2008-10-24 at 18:28 +0200, Benjamin Smith-Mannschott wrote:
> SVN Repo Space Efficiency when Edited often:
> 
> format         space efficiency  merge-friendlyness
> =============  ================  ==================
> plain text     very good         very good
> html           very good         good
> flat ODT:      good              poor [1]
> msword doc     acceptable        impossible [2]
> msword docx    poor              impossible [2]
> ODT            poor              impossible [2]
> ---------------------------------------------------
> [1] This format isn't widely supported (a pitty, really).
> [2] SVN will not and should not attempt to merge these
> formats as they are not textual. Microsoft-word and
> OpenOffice do contain features allowing a user to
> perform merges independently of svn within the tool,
> it's just that they'd have to do this "by hand" for
> every merge conflict.
> ===================================================
> 
> // Ben
> 

Just one question about merging with ODT-like documents: not being able
to merge would leave things no worse than binary docs, right?  By this I
mean that SVN does not try to merge changes in binaries currently.  So
one could still get advantages in space by doing some sort of
uncompress-diff process as I previously suggested, but one would just be
left without the possibility of merging.  Is this correct?

Also, why is html better at merging than flat ODT?  I would imagine that
any XML-like format would have problems with blind merging.

If anyone has thoughts on implementing this super-compressed text
diffing with appropriate hooks somehow, I would love to hear them.  It
doesn't sound that difficult to do the uncompress, tar, diff process,
but doing it right is probably out of my league.  

I also wonder if bazaar or some other cms has tried to tackle this
problem?

Cheers,
David

> 
-- 
**********************************
David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html
**********************************


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 16:21, David Kaplan wrote:

> Hi,
>
> On Wed, 2008-10-22 at 17:45 -0500, Ryan Schmidt wrote:
>
>> Subversion stores all files in the repository as differences against
>> previous versions. It does not differentiate between text or binary
>> files at this point. However, depending on the compression algorithm,
>> compressed files don't necessarily lend themselves to efficient
>> diffing, which can result in them taking more space in the repository
>> over time than the uncompressed versions would have.
>>
>
> I wasn't sure about this point, but my experience is that small  
> changes
> to a document seem to produce large diffs in the compressed version
> leading to a large repository.

I can confirm this.  We'll be using an svn repository for  
documentation in the future. On question that arose was which tool/ 
format to use for our textual documents. (I pushed for plain text  
(markdown or reStructuredText) because it diffs and merges nicely, but  
the usability story just isn't there for most of those who'll actually  
be writing the documentation.)

I ran some tests simulating a few thousand edits and commits using a  
few different formats. Traditional doc files are pretty well behaved  
WRT repository space usage. ODF files stink because every edit, no  
matter how minor, ends up storing the whole document in the repository  
again.  I discovered FODT (flat ODT), which merges all the parts of a  
normal ODT file into a single XML (images and other binary things are  
base-64 encoded). This sounds ludicrous, but it's quite svn-friendly.  
Unfortunately, the flat variants of the openoffice.org file formats  
only seem to be supported by the OO.o 2.4 included with Ubuntu. I've  
not found much mention of it online and I've not found it supported  
under Windows or MacOS. We finally settled on using OO.o's HTML  
support as our "standard" format for textual documentation, knowing  
that we could "upgrade" to ODT should we require its additional  
features.

SVN Repo Space Efficiency when Edited often:

format         space efficiency  merge-friendlyness
=============  ================  ==================
plain text     very good         very good
html           very good         good
flat ODT:      good              poor [1]
msword doc     acceptable        impossible [2]
msword docx    poor              impossible [2]
ODT            poor              impossible [2]
---------------------------------------------------
[1] This format isn't widely supported (a pitty, really).
[2] SVN will not and should not attempt to merge these
formats as they are not textual. Microsoft-word and
OpenOffice do contain features allowing a user to
perform merges independently of svn within the tool,
it's just that they'd have to do this "by hand" for
every merge conflict.
===================================================

// Ben



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by David Kaplan <Da...@ird.fr>.
Hi,

On Wed, 2008-10-22 at 17:45 -0500, Ryan Schmidt wrote:

> Subversion stores all files in the repository as differences against  
> previous versions. It does not differentiate between text or binary  
> files at this point. However, depending on the compression algorithm,  
> compressed files don't necessarily lend themselves to efficient  
> diffing, which can result in them taking more space in the repository  
> over time than the uncompressed versions would have.
> 

I wasn't sure about this point, but my experience is that small changes
to a document seem to produce large diffs in the compressed version
leading to a large repository.


> Note that an OpenOffice.org file is not a compressed text file, but a  
> compressed directory of several text files.
> 

Yes, but this shouldn't be too difficult to handle as at least the
standard linux diff command diffs directories without difficulty (say
that three times fast).  I believe that subversion uses its own
algorithm, but handling directories can't be too hard.  One option would
be to tar the directory into one file that would still be human
readable.

Cheers,
David

-- 
**********************************
David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France

Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html
**********************************


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: improving subversion treatment of compressed XML/text file formats

Posted by Ryan Schmidt <su...@ryandesign.com>.
On Oct 22, 2008, at 10:03, David Kaplan wrote:

> I use subversion as my personal backup system.  Though I do my  
> share of
> coding, a lot of what I put in my subversion database are  
> compressed XML
> files (for example, openoffice documents).  Currently, svn treats  
> these
> as binary files, leading to a ballooning svn database as there is no
> differencing on these files (correct me if I am wrong about this).

Subversion stores all files in the repository as differences against  
previous versions. It does not differentiate between text or binary  
files at this point. However, depending on the compression algorithm,  
compressed files don't necessarily lend themselves to efficient  
diffing, which can result in them taking more space in the repository  
over time than the uncompressed versions would have.

> For a while I have been thinking that svn could do a lot better than
> that since these are trivially compressed files.  This could reduce
> significantly the amount of disk space that versioning these files
> requires and improve the ability to see differences between files  
> (e.g.,
> conflict resolution).  As these file formats are popping up everywhere
> (openoffice, MS Office, ...), it might be worth integrating a third
> "type" of file into svn (along with text and binary): compressed-text.
> Someone smarter than I might even be able to do this with the current
> architecture of hooks with minimal changes to subversion itself, but a
> formal integration doesn't seem too hard.

Note that an OpenOffice.org file is not a compressed text file, but a  
compressed directory of several text files.

> The basic idea would be that when svn adds one of these files, it adds
> the full compressed version initially, but thereafter it uncompresses
> stored and working copy versions, differences them and just stores  
> these
> differences.  The user would specify which file formats to  
> autodetect as
> compressed text and the compression algorithm for each file type  
> through
> configuration options and svn properties.
>
> One question would be what to do with conflicts, but I think this  
> isn't
> a show stopper and a logical behavior can be found.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org