You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by David Kaplan <Da...@ird.fr> on 2008/10/24 14:21:54 UTC
Re: improving subversion treatment of compressed XML/text file
formats
Hi,
On Wed, 2008-10-22 at 17:45 -0500, Ryan Schmidt wrote:
> Subversion stores all files in the repository as differences against
> previous versions. It does not differentiate between text or binary
> files at this point. However, depending on the compression algorithm,
> compressed files don't necessarily lend themselves to efficient
> diffing, which can result in them taking more space in the repository
> over time than the uncompressed versions would have.
>
I wasn't sure about this point, but my experience is that small changes
to a document seem to produce large diffs in the compressed version
leading to a large repository.
> Note that an OpenOffice.org file is not a compressed text file, but a
> compressed directory of several text files.
>
Yes, but this shouldn't be too difficult to handle as at least the
standard linux diff command diffs directories without difficulty (say
that three times fast). I believe that subversion uses its own
algorithm, but handling directories can't be too hard. One option would
be to tar the directory into one file that would still be human
readable.
Cheers,
David
--
**********************************
David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France
Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html
**********************************
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 22:01, Henrik Sundberg wrote:
>> Some interesting previous discussion on this topic:
>>
>> http://svn.haxx.se/users/archive-2006-02/0180.shtml
>
> Yes. I suppose that counts for archives with very many files (using
> per file compression) and just a few of them changes.
> Is that really the case for odf files? Isn't it just a few files? Is
> just the "text"-part changed for small changes?
True... though the text part (content.xml) also tends to be by far the
largest single part:
[bsmith@Meheadable:~/brz/k/ooexperiment]
$ ls -sk UC-1002.odt*
24 UC-1002.odt
UC-1002.odt_: # unpacked version of UC-1002.odt
total 192
0 Configurations2 116 content.xml 4 mimetype
0 META-INF 4 layout-cache 12 settings.xml
0 Thumbnails 4 meta.xml 52 styles.xml
// Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by Henrik Sundberg <st...@gmail.com>.
On Fri, Oct 24, 2008 at 9:27 PM, <km...@rockwellcollins.com> wrote:
>
> "Henrik Sundberg" <st...@gmail.com> wrote on 10/24/2008 01:44:03 PM:
>> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
>> <bs...@gmail.com> wrote:
>> >> Just one question about merging with ODT-like documents: not being able
>> >> to merge would leave things no worse than binary docs, right? By this
>> >> I
>> >> mean that SVN does not try to merge changes in binaries currently.
>> >
>> > ODT-like documents are binary documents. Period.
>>
>> I disagree. Svn handles diffs in binary files. Compressed binaries are
>> different.
>
> Some interesting previous discussion on this topic:
>
> http://svn.haxx.se/users/archive-2006-02/0180.shtml
Yes. I suppose that counts for archives with very many files (using
per file compression) and just a few of them changes.
Is that really the case for odf files? Isn't it just a few files? Is
just the "text"-part changed for small changes?
/$
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by km...@rockwellcollins.com.
"Henrik Sundberg" <st...@gmail.com> wrote on 10/24/2008 01:44:03 PM:
> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
> <bs...@gmail.com> wrote:
> >> Just one question about merging with ODT-like documents: not being
able
> >> to merge would leave things no worse than binary docs, right? By
this I
> >> mean that SVN does not try to merge changes in binaries currently.
> >
> > ODT-like documents are binary documents. Period.
>
> I disagree. Svn handles diffs in binary files. Compressed binaries are
> different.
Some interesting previous discussion on this topic:
http://svn.haxx.se/users/archive-2006-02/0180.shtml
Kevin R.
Re: improving subversion treatment of compressed XML/text file formats
Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 20:44, Henrik Sundberg wrote:
> On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
> <bs...@gmail.com> wrote:
>>> Just one question about merging with ODT-like documents: not being
>>> able
>>> to merge would leave things no worse than binary docs, right? By
>>> this I
>>> mean that SVN does not try to merge changes in binaries currently.
>>
>> ODT-like documents are binary documents. Period.
>
> I disagree. Svn handles diffs in binary files. Compressed binaries are
> different.
On the client-side it makes sense to distinguish between "text" and
"binary". The former is composed of lines of mostly ascii characters
of reasonable length such that the unix commands diff and merge or
analogs can be used profitably. The latter is not amenable to such
treatment.
This is the distinction I was making when I called ODT-like documents
"binary".
On the server-side the distinction is a different one. The server
doesn't care the file is textual, in the client sense. The binary
differences (deltas) used on the server side just consider files they
store as a sequence of bytes without structure. (The deltas used by
the sever don't resemble the line-based diffs that a command like svn
diff will spit out.)
Because of this, the important distinction is between "delta-
friendly" [1] and "delta-hostile" files.
[1] I'm just making these words up. Anyone got a better suggestion?
Delta-friendly files exhibit local changes when they are edited. That
is, change a few words in your MS Word document, and a few opaque
blocks of bytes in the resulting file will be changed relative to the
previous version. Most of the file will remain as it was.
Delta-hostile files are ones where a trivial edit may make many
changes through the file, possibly even every byte in the file.
Because of the way most compression algorithms work, compressed files
are a good example of delta-hostility. (In fact they are something of
a worst-case: not only does the server have to store the full content,
it can't even perform compression on the content because said content
is already compressed.)
Compression doesn't have to be involved, however, for a file to be
fairly delta-hostile. Consider the XMI (a sort of XML dialect for UML
diagrams) produced by tools like Rational Modeler, these tend to
produce large deltas against the previous version because they make
heavy use of seemingly randomly generated IDs, which change every time
the file is saved. They also don't seem to write sub-elements out in a
consistent order.
Not all compressed files are necessarily maximally delta-hostile.
Consider a JAR. These are typically composed of many individual class
files. Because of the way the ZIP format (on which JAR is based)
works, each of these files is compressed separately. There's every
reason to expect that those classes which don't change from one
revision of the jar to another will still compress to the same stream
of bytes. Only those class files which have changed will be
(completely) different. There is still ample opportunity in such a
scenario for finding deltas on the server-side.
Hope that's clearer
// Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by Henrik Sundberg <st...@gmail.com>.
On Fri, Oct 24, 2008 at 8:35 PM, Benjamin Smith-Mannschott
<bs...@gmail.com> wrote:
>> Just one question about merging with ODT-like documents: not being able
>> to merge would leave things no worse than binary docs, right? By this I
>> mean that SVN does not try to merge changes in binaries currently.
>
> ODT-like documents are binary documents. Period.
I disagree. Svn handles diffs in binary files. Compressed binaries are
different.
/$
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 19:04, David Kaplan wrote:
> On Fri, 2008-10-24 at 18:28 +0200, Benjamin Smith-Mannschott wrote:
> > SVN Repo Space Efficiency when Edited often:
> > format space efficiency merge-friendlyness
> > ============= ================ ==================
> > plain text very good very good
> > html very good good
> > flat ODT: good poor [1]
> > msword doc acceptable impossible [2]
> > msword docx poor impossible [2]
> > ODT poor impossible [2]
> > ---------------------------------------------------
> > [1] This format isn't widely supported (a pitty, really).
> > [2] SVN will not and should not attempt to merge these
> > formats as they are not textual. Microsoft-word and
> > OpenOffice do contain features allowing a user to
> > perform merges independently of svn within the tool,
> > it's just that they'd have to do this "by hand" for
> > every merge conflict.
> > ===================================================
> Just one question about merging with ODT-like documents: not being
able
> to merge would leave things no worse than binary docs, right? By
this I
> mean that SVN does not try to merge changes in binaries currently.
ODT-like documents are binary documents. Period.
> So
> one could still get advantages in space by doing some sort of
> uncompress-diff process as I previously suggested, but one would
just be
> left without the possibility of merging. Is this correct?
Correct. The entries (files) in a ZIP can be compressed or not. If
there was a way to convince OpenOffice to just store its XML parts to
this zip without compressing them this would be enough to allow
subversion's binary server-side binary diffing algorithm to take
advantage of similarities between file revisions and thus save space.
On Oct 24, 2008, at 19:04, David Kaplan wrote:
> Also, why is html better at merging than flat ODT? I would imagine
that
> any XML-like format would have problems with blind merging.
I'm continually surprised how little research has apparently been done
on the relative mergability of various formats. The traditional
3-way-merge operates over a simple sequences of values (lines of
text), identifying common regions and differences. This works okay
for simple formats but breaks down with formats which have complicated
invariants.
HTML, particularly hand-edited is often simple enough in structure to
allow merging. It's also simple enough for human to fix, at the source
level should there be a conflict. flat ODT? not so much.
For a more detailed take, let me include an excerpt from a feature
proposal I wrote recently for the svn pre-commit hook script at my
place of work:
B. Smith-Mannschott Wrote, in "Subversion Hookscript Requirements
Proposal":
| * Is XML textual or not?
|
| XML presents yet another complication. It should also most properly be
| application/xml. A mime-type of text/xml would have to specify its
| encoding with a charset on svn:mime-type and this would be redundant
| for XML, since it carries its encoding information in-band.
|
| However, using application/xml for all XML files would be problematic
| because it would prevent subversion from attempting to merge
| changes. Even more annoyingly, it would make it unecessarily
| cumbersome to view changes (svn diff).
|
| One size does not fit all. Some XML files will be edited by hand and
| are structured simply enough that line-based merging has a good chance
| of success and line-based diff produces a useful comparison. Examples
| include XML schemas, XML transformations, maven project object models,
| XHTML pages and many more.
|
| Other XML files are huge dumps of hideously complicated data
| structures with all kinds of complex and undocumented
| constraints. These should be treated as if they were binary files and
| given an application/xml mime type. These kinds of files are rarely
| seen in source form and virtually impossible to diff and merge
| successfully. XMI (Rational Modeler UML) files fall in this category,
| as do the "flat" variants of the OpenOffice.org file formats.
// Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file
formats
Posted by David Kaplan <Da...@ird.fr>.
Hi,
On Fri, 2008-10-24 at 18:28 +0200, Benjamin Smith-Mannschott wrote:
> SVN Repo Space Efficiency when Edited often:
>
> format space efficiency merge-friendlyness
> ============= ================ ==================
> plain text very good very good
> html very good good
> flat ODT: good poor [1]
> msword doc acceptable impossible [2]
> msword docx poor impossible [2]
> ODT poor impossible [2]
> ---------------------------------------------------
> [1] This format isn't widely supported (a pitty, really).
> [2] SVN will not and should not attempt to merge these
> formats as they are not textual. Microsoft-word and
> OpenOffice do contain features allowing a user to
> perform merges independently of svn within the tool,
> it's just that they'd have to do this "by hand" for
> every merge conflict.
> ===================================================
>
> // Ben
>
Just one question about merging with ODT-like documents: not being able
to merge would leave things no worse than binary docs, right? By this I
mean that SVN does not try to merge changes in binaries currently. So
one could still get advantages in space by doing some sort of
uncompress-diff process as I previously suggested, but one would just be
left without the possibility of merging. Is this correct?
Also, why is html better at merging than flat ODT? I would imagine that
any XML-like format would have problems with blind merging.
If anyone has thoughts on implementing this super-compressed text
diffing with appropriate hooks somehow, I would love to hear them. It
doesn't sound that difficult to do the uncompress, tar, diff process,
but doing it right is probably out of my league.
I also wonder if bazaar or some other cms has tried to tackle this
problem?
Cheers,
David
>
--
**********************************
David M. Kaplan
Charge de Recherche 1
Institut de Recherche pour le Developpement
Centre de Recherche Halieutique Mediterraneenne et Tropicale
av. Jean Monnet
B.P. 171
34203 Sete cedex
France
Phone: +33 (0)4 99 57 32 27
Fax: +33 (0)4 99 57 32 95
http://www.ur097.ird.fr/team/dkaplan/index.html
**********************************
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org
Re: improving subversion treatment of compressed XML/text file formats
Posted by Benjamin Smith-Mannschott <bs...@gmail.com>.
On Oct 24, 2008, at 16:21, David Kaplan wrote:
> Hi,
>
> On Wed, 2008-10-22 at 17:45 -0500, Ryan Schmidt wrote:
>
>> Subversion stores all files in the repository as differences against
>> previous versions. It does not differentiate between text or binary
>> files at this point. However, depending on the compression algorithm,
>> compressed files don't necessarily lend themselves to efficient
>> diffing, which can result in them taking more space in the repository
>> over time than the uncompressed versions would have.
>>
>
> I wasn't sure about this point, but my experience is that small
> changes
> to a document seem to produce large diffs in the compressed version
> leading to a large repository.
I can confirm this. We'll be using an svn repository for
documentation in the future. On question that arose was which tool/
format to use for our textual documents. (I pushed for plain text
(markdown or reStructuredText) because it diffs and merges nicely, but
the usability story just isn't there for most of those who'll actually
be writing the documentation.)
I ran some tests simulating a few thousand edits and commits using a
few different formats. Traditional doc files are pretty well behaved
WRT repository space usage. ODF files stink because every edit, no
matter how minor, ends up storing the whole document in the repository
again. I discovered FODT (flat ODT), which merges all the parts of a
normal ODT file into a single XML (images and other binary things are
base-64 encoded). This sounds ludicrous, but it's quite svn-friendly.
Unfortunately, the flat variants of the openoffice.org file formats
only seem to be supported by the OO.o 2.4 included with Ubuntu. I've
not found much mention of it online and I've not found it supported
under Windows or MacOS. We finally settled on using OO.o's HTML
support as our "standard" format for textual documentation, knowing
that we could "upgrade" to ODT should we require its additional
features.
SVN Repo Space Efficiency when Edited often:
format space efficiency merge-friendlyness
============= ================ ==================
plain text very good very good
html very good good
flat ODT: good poor [1]
msword doc acceptable impossible [2]
msword docx poor impossible [2]
ODT poor impossible [2]
---------------------------------------------------
[1] This format isn't widely supported (a pitty, really).
[2] SVN will not and should not attempt to merge these
formats as they are not textual. Microsoft-word and
OpenOffice do contain features allowing a user to
perform merges independently of svn within the tool,
it's just that they'd have to do this "by hand" for
every merge conflict.
===================================================
// Ben
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org