You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openoffice.apache.org by Liang Weike <we...@cs2c.com.cn> on 2012/01/09 08:55:40 UTC

Resolving MS Word Binary File Format

Hi all,

I'm making an investigation of OpenOffice processing the documents of MS
Office's binary file formats.
I find that when OpenOffice reads and writes the documents of MS
Office's binary file formats(.doc,.ppt,.xls) it will spend much time and
lose some formats that lead to a little distortion of layout.
We know that MS has announced its office's binary file formats are open
and provided the details documents. But that happened after OpenOffice
could open and save them.
So, has OpenOffice improved the flow and construction of resolving MS
Office's binary file formats after MS offered the specification? And now
is there anyone who is working on or interested in it?

-- 
Regards,
Liang Weike


Re: Resolving MS Word Binary File Format

Posted by Dave Fisher <da...@comcast.net>.
On Jan 17, 2012, at 2:50 PM, Andrea Pescetti wrote:

> On 09/01/2012 Liang Weike wrote:
>> I'm making an investigation of OpenOffice processing the documents of MS
>> Office's binary file formats. ...
>> So, has OpenOffice improved the flow and construction of resolving MS
>> Office's binary file formats after MS offered the specification?
> 
> As far as I know, no substantial rewriting of the filters for
> doc/xls/ppt files happened after Microsoft released the specification:
> I've seen several incremental improvements over the years, but never a
> complete rewrite.
> 
> I don't actually know if Microsoft released the specification in a form
> that would make it easy to write an import filter from scratch.

If you want to use Java then you might take a look at Apache POI. It will at least give you ideas and the synergy back if AOO is further along with something would be great as well.

Regards,
Dave


> 
> Regards,
>  Andrea.


RE: Resolving MS Word Binary File Format

Posted by "Dennis E. Hamilton" <de...@acm.org>.
I just downloaded the complete 70MB Zip of the complete set from the first link on this page: <http://msdn.microsoft.com/en-us/library/cc313118.aspx>.

The [MS-DOC].pdf is over 600 pages.  So it is not a treat to implement from scratch, especially with figuring out how to map into/from the OpenOffice.org model.  Having the code of an existing converter would provide something to gut for structure and maybe even to morph rather than do from scratch.  (There are also related documents that need to be consulted for specialized aspects that are common across the Microsoft Office programs.)

Based on the quality that I found in the RTF specification, I suspect there is more than enough to base an implementation on, but that is a superficial appraisal of this document.

There is also code that works with these formats (e.g., Apache Poi) and there may be other converters that can be consulted.  I thought there was a relevant SourceForge project, but Poi may be more current and active.

Consultation of the specifications for OOXML might also be helpful, since there is considerable semantic harmony between those and the binaries, at least to a point.

No implementation can be done that is not test-driven and in particular heavily tested with documents in and out of the Microsoft Office products.  

 - Dennis

-----Original Message-----
From: Andrea Pescetti [mailto:pescetti@apache.org] 
Sent: Tuesday, January 17, 2012 14:50
To: ooo-dev@incubator.apache.org
Subject: Re: Resolving MS Word Binary File Format

On 09/01/2012 Liang Weike wrote:
> I'm making an investigation of OpenOffice processing the documents of MS
> Office's binary file formats. ...
> So, has OpenOffice improved the flow and construction of resolving MS
> Office's binary file formats after MS offered the specification?

As far as I know, no substantial rewriting of the filters for
doc/xls/ppt files happened after Microsoft released the specification:
I've seen several incremental improvements over the years, but never a
complete rewrite.

I don't actually know if Microsoft released the specification in a form
that would make it easy to write an import filter from scratch.

Regards,
  Andrea.


Re: Resolving MS Word Binary File Format

Posted by Andrea Pescetti <pe...@apache.org>.
On 09/01/2012 Liang Weike wrote:
> I'm making an investigation of OpenOffice processing the documents of MS
> Office's binary file formats. ...
> So, has OpenOffice improved the flow and construction of resolving MS
> Office's binary file formats after MS offered the specification?

As far as I know, no substantial rewriting of the filters for
doc/xls/ppt files happened after Microsoft released the specification:
I've seen several incremental improvements over the years, but never a
complete rewrite.

I don't actually know if Microsoft released the specification in a form
that would make it easy to write an import filter from scratch.

Regards,
  Andrea.