You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Jonathan S. Shapiro" <sh...@eros-os.org> on 2002/08/14 00:43:42 UTC

Re: [opencm-dev] Re: [Prcs-list] [Prcs-cvs-list] Diff/Comparison of file formats others than ASCII/source code?

On Thu, 2002-08-08 at 23:08, Donovan Baarda wrote:

> Any merge/diff3 operation must be file-format aware to get it right. Text is
> a nice common denominator that the existing unix diff3 can handle. Something
> like a HTML/XML/RTF aware diff3 could be relatively easily implemented as a
> postprocessing stage to merges produced by the standard diff3 because they
> use text as their underlying format.

True, but not helpful. Most tools that export HTML export it without
newlines, making line-based diff essentially useless.

IBM has a decent DOM diff program for XML that could be readily adapted
to HTML. I think that I'ld be inclined to go that way rather than try to
hack up diff for this purpose.


shap


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [A few SCM lists] Diff/Comparison of file formats others than ASCII/source code?

Posted by Alessandro Bottoni <al...@libero.it>.
On Wednesday 14 August 2002 02:43, Jonathan S. Shapiro wrote:
>On Thu, 2002-08-08 at 23:08, Donovan Baarda wrote:
>> Any merge/diff3 operation must be file-format aware to get it right. Text
>> is a nice common denominator that the existing unix diff3 can handle.
>> Something like a HTML/XML/RTF aware diff3 could be relatively easily
>> implemented as a postprocessing stage to merges produced by the standard
>> diff3 because they use text as their underlying format.
>
>True, but not helpful. Most tools that export HTML export it without
>newlines, making line-based diff essentially useless.
>
>IBM has a decent DOM diff program for XML that could be readily adapted
>to HTML. I think that I'ld be inclined to go that way rather than try to
>hack up diff for this purpose.
>

Let me underline a strange situation...

Most, if not all, the RCS-SCM that I'm reviewing (wisely) delegate the 
Diff/Merge/Comparison of the files content to an external program so, in 
principle, most RCS-SCM can be adapted to every possible file format, from 
XML/HTML to RTF to 3D CAD models and so on.

Strangely enough, there are very few of such "file-format-specific" 
Diff/Merge tools around. This is strange because it is clear that such tools 
could have a huge market. Just think to how many companies have large 
repositories of CAD drawings, RTF (or, worse, MS Word) documents and HTML 
files (that is: web sites). A RCS tool that was able to manage such file 
formats would be of great help for a lot of people.

I hope that some developer of the list will think over this market 
niche (even if as a commercial, not open source, one).

Re: [opencm-dev] Re: [A few SCM lists] Diff/Comparison of file formats others than ASCII/source code?

Posted by Noel Yap <ya...@yahoo.com>.
--- David Brown <op...@davidb.org> wrote:
> Other file formats cause the issue to become
> significantly more
> complicated.  Take MS-word for example.  Word itself
> will gladly show
> you document differences between two documents where
> small changes have
> been made to the contents of text.  But what if one
> document has had
> significant formatting changes (change the paragraph
> style).  How would
> you do a merge if one branch changed the formatting,
> and the other
> branch changed the text, or maybe a different aspect
> of the formatting.

This would really depend on how the info is stored. 
It sounds like there's a viable algorithm for
tree-based representations so if Word stored it's data
in trees (isn't MS moving towards XML
representations?), one would exist for Word.  Doesn't
ClearCase have a diffing algorithm for Word and other
MS docs.

> CAD drawings are even more difficult.  It may not be
> difficult to
> determine what has changed, but how do you represent
> that change.  A
> diff tool is plausible (show both drawings, in
> different colors, for
> example), but a merge is even more difficult.

I partially agree.  The real question is, "What does
'merge' mean for images?"

Technically, if a diffing algorithm exists, a merge
algorithm can also exist (eg do something similar to
diff3).  The difficulty is whether the merge algorithm
conforms to the human interpretation of merging for
the particular item.

MTC,
Noel

__________________________________________________
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [A few SCM lists] Diff/Comparison of file formats others than ASCII/source code?

Posted by Michael Poole <po...@troilus.org>.
Alan Langford <ja...@ambitonline.com> writes:

> I've been pondering this and it's even more fun than that. A good
> number of Word documents contain drawn graphics,  embedded images,
> embedded spreadsheets, embedded Visio diagrams... you name it. Any
> diff utility that understands Word needs to be able to accommodate
> plug-ins that understand most major types of embedded object. It seems
> that implementing anything close to a comprehensive Word diff is
> prohibitively complex.

That is a poor reason to not try it.  It is much more useful to say
"This sub-object is being treated as an opaque blob, and differs
between these revisions" than to say "We cannot handle some cases of
this file type, so we won't try at all."  You can have the same
problem in any compound document.  Although I don't use Word much, the
number of cases for which it *will* work seems to justify the effort.

Michael Poole

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [opencm-dev] Re: [A few SCM lists] Diff/Comparison of file formats others than ASCII/source code?

Posted by Alan Langford <ja...@ambitonline.com>.
At 2002/08/14 09:20 -0700, David Brown wrote:
>Other file formats cause the issue to become significantly more
>complicated.  Take MS-word for example.  Word itself will gladly show
>you document differences between two documents where small changes have
>been made to the contents of text.  But what if one document has had
>significant formatting changes (change the paragraph style).  How would
>you do a merge if one branch changed the formatting, and the other
>branch changed the text, or maybe a different aspect of the formatting.

I've been pondering this and it's even more fun than that. A good number of 
Word documents contain drawn graphics,  embedded images, embedded 
spreadsheets, embedded Visio diagrams... you name it. Any diff utility that 
understands Word needs to be able to accommodate plug-ins that understand 
most major types of embedded object. It seems that implementing anything 
close to a comprehensive Word diff is prohibitively complex.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [opencm-dev] Re: [A few SCM lists] Diff/Comparison of file formats others than ASCII/source code?

Posted by David Brown <op...@davidb.org>.
On Wed, Aug 14, 2002 at 09:07:53AM +0200, Alessandro Bottoni wrote:

> Strangely enough, there are very few of such "file-format-specific" 
> Diff/Merge tools around. This is strange because it is clear that such tools 
> could have a huge market. Just think to how many companies have large 
> repositories of CAD drawings, RTF (or, worse, MS Word) documents and HTML 
> files (that is: web sites). A RCS tool that was able to manage such file 
> formats would be of great help for a lot of people.
> 
> I hope that some developer of the list will think over this market 
> niche (even if as a commercial, not open source, one).

Part of the reason may just be the definition of "diference".  For text
files, it is simple to define a difference.  Break the file into lines,
and report which lines are different.  Most of the additional features
of modern text diff tools has to do with how to present the information.

Other file formats cause the issue to become significantly more
complicated.  Take MS-word for example.  Word itself will gladly show
you document differences between two documents where small changes have
been made to the contents of text.  But what if one document has had
significant formatting changes (change the paragraph style).  How would
you do a merge if one branch changed the formatting, and the other
branch changed the text, or maybe a different aspect of the formatting.

CAD drawings are even more difficult.  It may not be difficult to
determine what has changed, but how do you represent that change.  A
diff tool is plausible (show both drawings, in different colors, for
example), but a merge is even more difficult.

Dave Brown

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [opencm-dev] Re: [Prcs-list] [Prcs-cvs-list] Diff/Comparison of file formats others than ASCII/source code?

Posted by "Jonathan S. Shapiro" <sh...@eros-os.org>.
On Wed, 2002-08-14 at 00:45, Daniel Berlin wrote:
> On Tue, 13 Aug 2002, Jonathan S. Shapiro wrote:
> There also exists an xmldiff written in python that uses the tree to tree 
> correction algorithm described in  "Change detection in hierarchically structured
> information" by S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom
> 
> http://www.logilab.org/xmldiff/
>  

That is very cool, as the strategy works for any tree-structured
information. It could be adapted for RTF, for example.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [opencm-dev] Re: [Prcs-list] [Prcs-cvs-list] Diff/Comparison of file formats others than ASCII/source code?

Posted by Daniel Berlin <db...@dberlin.org>.
On Tue, 13 Aug 2002, Jonathan S. Shapiro wrote:

> On Thu, 2002-08-08 at 23:08, Donovan Baarda wrote:
> 
> > Any merge/diff3 operation must be file-format aware to get it right. Text is
> > a nice common denominator that the existing unix diff3 can handle. Something
> > like a HTML/XML/RTF aware diff3 could be relatively easily implemented as a
> > postprocessing stage to merges produced by the standard diff3 because they
> > use text as their underlying format.
> 
> True, but not helpful. Most tools that export HTML export it without
> newlines, making line-based diff essentially useless.
> 
> IBM has a decent DOM diff program for XML that could be readily adapted
> to HTML. I think that I'ld be inclined to go that way rather than try to
> hack up diff for this purpose.

There also exists an xmldiff written in python that uses the tree to tree 
correction algorithm described in  "Change detection in hierarchically structured
information" by S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom

http://www.logilab.org/xmldiff/
 
> 
> 
> shap
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org