You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@openoffice.apache.org by Peter Kelly <ke...@gmail.com> on 2014/08/15 08:56:54 UTC

DocFormats - Open source OOXML implementation

Those of you interested in OOXML may want to have a look at my own implementation of (a subset of) the spec, which is part of a library I've just made available as open source (license is ASLv2):

https://github.com/uxproductivity/DocFormats

I started working on this around two years ago as part of UX Write, and it's been included in the version shipping on the iOS app store since February 2013. I've recently finished removing all dependencies on iOS/OS X APIs, and converting all the code from Objective C to plain C99. It now also builds on Linux, with Windows not being too far away.

The design is based on bidirectional transformation, as a way of achieving non-destructive editing of foreign file formats. This permits incremental implementation of a given spec without risking data loss due to incomplete features, since unsupported features of a given file format are left untouched on save. UX Write uses HTML as both its native file format and in-memory data model (via WebKit), but relies on DocFormats to read & write .docx files, as well as export to LaTeX. The next major task I plan to work on (hopefully with help from others!) is .odt support.

Now that this is open source, the eventual goal is for it to be generally usable by any app which has a need to support multiple file formats, such as OOXML and ODF. Currently it is limited to word processing formats only, but I'm interested in expanding it to cover spreadsheets, presentations, and drawings. Aside from editors, it also could be used for batch conversion tools, document analysis, web publishing, and other purposes.

There are minimal dependencies (basically only libxml and zlib), to make it easy to integrate into different apps. I'm not a fan of huge monolithic architectures, and have kept it very independent of other other aspects of UX Write for this very purpose. Note that this means there is no editing or rendering code; it deals solely with conversion. UX Write uses WebKit for the rendering, but there are many other ways in which one could build on top of this.

I'll be presenting on this at ApacheCon EU this November - see the talk "Addressing File Format Compatibility in Word Processors" at http://apacheconeu2014.sched.org.

Comments/questions are welcome.

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: DocFormats - Open source OOXML implementation

Posted by Andrea Pescetti <pe...@apache.org>.

On 16/08/2014 Peter Kelly wrote:
> On 16 Aug 2014, at 12:55 pm, Andrea Pescetti wrote:
>> I've also been fixing (or breaking, who knows!) some documentation on
>> my clone (my "fork" as Github likes to call it) but I'll submit a pull
>> request only when basic things work.
> I've just merged in your changes and also invited you as a committer

Thanks. Note (this is just for information, I have absolutely nothing 
against it!) that Apache projects using Github as primary source have a 
policy of not integrating code without a pull request. So one needs to 
"fork" (in the Github sense of course, so not a "fork" in its common 
meaning) the project and create a pull request. This is necessary 
because Apache prefers (and at time requires) that all patches being 
integrated are not only under the right license, but also voluntarily 
contributed. "Apache Way" class finished, sorry for being boring and 
let's move on...

> Then you'll be able to push directly to it instead of
> having to maintain your own fork.

Perfect. Of course, it was a fork in the Github meaning rather than the 
common meaning, so I never meant to maintain a separate version, I just 
wanted to produce pull requests,

> I vote that we establish a policy of rebasing instead of merging in the
> general case (unless there's a good reason to do otherwise), as this
> will help maintain a mostly-linear history

No strong preferences for me. But I won't commit anything to the 
repository until I get my account properly configured, since it is from 
my work account and I can't afford to mix (so, if a couple of commit 
with the wrong e-mail address already sneaked in, this is already bad, 
but I'll now setup my accounts and environments properly before doing 
any other activity).

Anyway, I have nothing to commit at the moment in terms of code.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org

Re: DocFormats - Open source OOXML implementation

Posted by Peter Kelly <ke...@gmail.com>.

On 16 Aug 2014, at 12:55 pm, Andrea Pescetti <pe...@apache.org> wrote:

> I've also been fixing (or breaking, who knows!) some documentation on my clone (my "fork" as Github likes to call it) but I'll submit a pull request only when basic things work.

I've just merged in your changes and also invited you as a committer to the repository. Then you'll be able to push directly to it instead of having to maintain your own fork.

I vote that we establish a policy of rebasing instead of merging in the general case (unless there's a good reason to do otherwise), as this will help maintain a mostly-linear history and avoid the annoyances described in [1]. That is, if before I push to the repository I see that the remote master has advanced (due to you or Jan committing something else), I'll rebase my commits on top of yours, so they look like they come "after" them in the history. Likewise, you and Jan would do the same if I've made commits. What do you think?

http://blog.spreedly.com/2014/06/24/merge-pull-request-considered-harmful

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: DocFormats - Open source OOXML implementation

Posted by Andrea Pescetti <pe...@apache.org>.

Peter Kelly wrote:
> On 16 Aug 2014, at 5:26 am, Andrea Pescetti wrote:
>> Does this mean that
>> $ dfutil/dfutil filename.docx filename.html
>> $ dfutil/dfutil filename.html filename2.docx
>> should produce a "filename2.docx" that is quite similar to
>> "filename.docx"? It is failing rather badly (invalid OOXML output in
>> the second conversion, ZIP container clearly missing files and
>> possible breaking order) in a simple test I did with a 1-page docx file.
>
> I'm not surprised this is the first issue to come up :$ There's a *lot*
> of knowledge I need to document for others; questions from you and
> others are the best way to motivate me to get that written ;)

I've also been fixing (or breaking, who knows!) some documentation on my 
clone (my "fork" as Github likes to call it) but I'll submit a pull 
request only when basic things work.

> Since the
> filename.html you generated does, it tries to map these to elements in
> the docx file, failing badly.

OK, but the following fails equally badly (producing an invalid OOXML 
file, even though this time it looks more consistent in size and 
internal content with filename.docx):
$ dfutil/dfutil filename.docx filename.html
Created filename.html
$ dfutil/dfutil filename.html filename.docx

What the best channel to report this issue and the 38 tests that are 
failing in my setup (provided they are all expected to pass)?

> - Include a hash of the .docx file (or relevant parts of it) in the HTML
> file, e.g. as a meta element or as part of the prefix on all id attributes

Seems a good idea. Perhaps having it as a meta element will be enough, 
unless it makes sense for some reason to link each attribute to a 
specific .docx file. Still, this won't solve the problem above.

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org

RE: DocFormats - Open source OOXML implementation

Posted by "Dennis E. Hamilton" <de...@acm.org>.

OK, I get it.  There is cross-talk between this dev-openoffice list and general-incubator involving two messages there,

1. A general-incubator post from you, replying to a message from Peter Kelley about his DocFormats document-conversion project and bringing Peter's request to the attention of general-incubator, at
<http://mail-archives.apache.org/mod_mbox/incubator-general/201408.mbox/%3CCAK2iWdTS%2BKUWWZ%2BBOAnsNW4PiE37OLJA%3Dx%2B5az%3DAdAviiS_47A%40mail.gmail.com%3E>.

2. An observation from Andrea that is essentially good wishes.

I find it an interesting leap from DocFormats to OpenOffice for tablets and look forward to seeing the incubator proposal.

I am definitely interested in the "student proposal, to get a compliance sheet made
for products that offer OXML and/or odf" that you mention.

Interoperability in interchange among document formats is a driving issue for me.  I look forward to more about that.  There has been significant effort in this area, although it does not seem to have made much impact and is generally little-known.  The OASIS effort on ODF Interoperability and Conformance (OIC TC) folded its tent in November 2013.  (On that one, I am an unindicted co-conspirator.)

I will see what references I can dig up after I submit updated pre-conference versions of some papers due this weekend, <https://sites.google.com/site/dchanges14/program>.  Information about those interop/conversion efforts would also be good backup information for the DChanges 2014 workshop next month.

 - Dennis

PS: Roundtripping between OOXML and HTML is something that Microsoft put considerable effort into.  Some found the resulting HTML (pre-HTML5) rather nauseous, but it is remarkably presentation-preserving as far as it goes. It might be informative to look into how well AOO does the same between ODF and [X]HTML as a calibration.  One could also look at the Office Web Apps, that manifest OOXML documents via editable web-page interfaces as a descendant.  These seem to be tied to the way that some Phone and Tablet Microsoft Office applications are tied to cloud-stored documents.

-----Original Message-----
From: jan i [mailto:jani@apache.org] 
Sent: Saturday, August 16, 2014 09:45
To: Dennis Hamilton
Cc: dev; jan iversen
Subject: Re: DocFormats - Open source OOXML implementation

On 16 August 2014 18:38, Dennis E. Hamilton <de...@acm.org> wrote:

> I don't have any skin in this game.
>
> Yet I am baffled about where this work is going on and what Apache Project
> it relates to.  Is there an incubator proposal for Apache DocFormats on its
> way?
>
Yes there is a proposal on its way, look at general-incubator approx. the
last 3 days. Right now it is not decided who should sponsor this project.

[ ... ]

The intention is clearly to at least have a close cooperation with these
projects. But docFormats aims at a bit more (like e.g. being openoffice on
tablets).

I am right now working on student proposal, to get a compliance sheet made
for products that offer OXML and/or odf. MAybe that would be something you
would want to help out with.

[ ... ]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org

Re: DocFormats - Open source OOXML implementation

Posted by jan i <ja...@apache.org>.

On 16 August 2014 18:38, Dennis E. Hamilton <de...@acm.org> wrote:

> I don't have any skin in this game.
>
> Yet I am baffled about where this work is going on and what Apache Project
> it relates to.  Is there an incubator proposal for Apache DocFormats on its
> way?
>
Yes there is a proposal on its way, look at general-incubator approx. the
last 3 days. Right now it is not decided who should sponsor this project.


>
> In particular, I would expect that some thought would be given to the ODF
> Toolkit and that incubator project, <
> http://incubator.apache.org/odftoolkit/>.
>
> Also, Apache POI would seem to have some relevance, especially the
> OpenXML4J component, <http://poi.apache.org/>.
>
The intention is clearly to at least have a close cooperation with these
projects. But docFormats aims at a bit more (like e.g. being openoffice on
tablets).

I am right now working on student proposal, to get a compliance sheet made
for products that offer OXML and/or odf. MAybe that would be something you
would want to help out with.


> These are all Java based, as is Armin's current project in the AOO
> repository.  I haven't listed open-source projects outside the embrace of
> ASF.
>
> A single <orcnote> remark is in-line below (although this notation may
> derail defective HTML presentation of plaintext containing angle brackets).
>
> Re-subscribing to general-incubator now ...
>
> Oh, and congratulations on joining the IPMC, Jan.
>
thanks a lot.

rgds
jan i

>
>  -- Dennis E. Hamilton
>     dennis.hamilton@acm.org    +1-206-779-9430
>     https://keybase.io/orcmid  PGP F96E 89FF D456 628A
>     X.509 certs used and requested for signed e-mail
>
>
>
> -----Original Message-----
> From: jan i [mailto:jani@apache.org]
> Sent: Saturday, August 16, 2014 01:10
> To: dev
> Subject: Re: DocFormats - Open source OOXML implementation
>
> On 16 August 2014 03:50, Peter Kelly <ke...@gmail.com> wrote:
>
> [ ... ]
> > Now, onto the fix:
> >
> > The library needs to have some way of checking that the HTML file being
> > used as part of an update operation has a mapping (id attributes) that
> > match the docx file being updated (in the case of creating a new file,
> this
> > is just an empty docx file). In the even that this is not the case, it
> > could still do the update, but would act as if the entire document had
> been
> > replaced with a completely new one.
> >
> > The solution I'll likely implement (and this should really be my first
> > task, given the potential for problems like the above is this):
> >
> In my humble opinion you should not use time on this right now.
>
> If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed)
> If you start getting the documentation right we have a 1-n relations (1 man
> used, n men help fix bugs).
>
> Please have in mind, we build a community in order to move away from "I
> have to do it, because I am the only one who know how" and you are the most
> important enabler of that......we need your knowledge in a file, so that
> others can work.
>
> [ ... ]
>
> When the project (hopefully) enters incubator, we will automatically have
> access to a bug tracking system (jira), and with that hopefully only being
> some month away I would not recommend setting up one now.
>
> <orcnote>
>    On Github, there is already an issues structure,
>    <https://github.com/uxproductivity/DocFormats/issues>.
>    I think this should be continued in use until a different
>    setup arrives "any day soon".  Note that some Github projects
>    create a single subrepository that is just for its issues
>    function.  E.g., https://github.com/keybase/keybase-issues
> </orcnote>
>
>
> [ ... ]
>
>
>

Re: DocFormats - Open source OOXML implementation

Posted by Peter Kelly <ke...@gmail.com>.

ODF toolkit and Apache POI are both APIs to specific file formats. The key differences with DocFormats are

1. Support for multiple file formats (a limited range supported presently, but the intention is to expand to other formats)
2. Ability to "abstract over" a file format, in that the goal is to allow people to write apps without caring what format the data is physically stored in
3. Use of HTML as a common intermediate format during translation (though other formats can be manipulated if natively supported by an editor, then converted back to the source format)
4. Bi-directionality, i.e. the ability to do non-destructive updates when converting between formats
5. A building-block for creating HTML-based editors, viewers, and other applications (in particular, using WebKit or other browser engines)
6. No reliance on Java

If you want to do mobile, you can't use anything Java-based - that is, if you want to support iOS. No-one can use the two projects you mentioned if they want to build an iPhone or iPad app, which is one of several reasons ODF is absent from the mobile space.

Although in the past, Java has been a great choice for cross-platform applications, sadly this is no longer the case - hence C (the code was originally in Objective C but translated to C). It's also an extra dependency which can unnecessarily bloat requirements for an application, whereas this is very lightweight.

In addition to a library for dealing with file formats, the overall idea is much wider than that - to build applications on top of this, such as an editor, also within the context of the project. And also, promoting the idea of "file format independence", in the same way as most now see platform-independence as a good thing. We're looking to make it as flexible as possible, so that it can be adapted for mobile, desktop, and web. It's sort of a "clean start" in a sense, though not necessarily aiming to entirely replicate existing projects, but rather something new.

Both the ODF toolkit and Apache POI have useful work which will quite possibly be of use. In particular I think the latter may be helpful for supporting the older binary MS file formats, and we hope to collaborate with other Apache projects where relevant.

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

On 16 Aug 2014, at 11:38 pm, Dennis E. Hamilton <de...@acm.org> wrote:

> I don't have any skin in this game.
> 
> Yet I am baffled about where this work is going on and what Apache Project it relates to.  Is there an incubator proposal for Apache DocFormats on its way?
> 
> In particular, I would expect that some thought would be given to the ODF Toolkit and that incubator project, <http://incubator.apache.org/odftoolkit/>.
> 
> Also, Apache POI would seem to have some relevance, especially the OpenXML4J component, <http://poi.apache.org/>.
> 
> These are all Java based, as is Armin's current project in the AOO repository.  I haven't listed open-source projects outside the embrace of ASF.
> 
> A single <orcnote> remark is in-line below (although this notation may derail defective HTML presentation of plaintext containing angle brackets).
> 
> Re-subscribing to general-incubator now ... 
> 
> Oh, and congratulations on joining the IPMC, Jan.
> 
> -- Dennis E. Hamilton
>    dennis.hamilton@acm.org    +1-206-779-9430
>    https://keybase.io/orcmid  PGP F96E 89FF D456 628A
>    X.509 certs used and requested for signed e-mail
> 
> 
> 
> -----Original Message-----
> From: jan i [mailto:jani@apache.org] 
> Sent: Saturday, August 16, 2014 01:10
> To: dev
> Subject: Re: DocFormats - Open source OOXML implementation
> 
> On 16 August 2014 03:50, Peter Kelly <ke...@gmail.com> wrote:
> 
> [ ... ]
>> Now, onto the fix:
>> 
>> The library needs to have some way of checking that the HTML file being
>> used as part of an update operation has a mapping (id attributes) that
>> match the docx file being updated (in the case of creating a new file, this
>> is just an empty docx file). In the even that this is not the case, it
>> could still do the update, but would act as if the entire document had been
>> replaced with a completely new one.
>> 
>> The solution I'll likely implement (and this should really be my first
>> task, given the potential for problems like the above is this):
>> 
> In my humble opinion you should not use time on this right now.
> 
> If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed)
> If you start getting the documentation right we have a 1-n relations (1 man
> used, n men help fix bugs).
> 
> Please have in mind, we build a community in order to move away from "I
> have to do it, because I am the only one who know how" and you are the most
> important enabler of that......we need your knowledge in a file, so that
> others can work.
> 
> [ ... ]
> 
> When the project (hopefully) enters incubator, we will automatically have
> access to a bug tracking system (jira), and with that hopefully only being
> some month away I would not recommend setting up one now.
> 
> <orcnote>
>   On Github, there is already an issues structure, 
>   <https://github.com/uxproductivity/DocFormats/issues>.
>   I think this should be continued in use until a different 
>   setup arrives "any day soon".  Note that some Github projects 
>   create a single subrepository that is just for its issues 
>   function.  E.g., https://github.com/keybase/keybase-issues
> </orcnote>
> 
> 
> [ ... ]
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
> For additional commands, e-mail: dev-help@openoffice.apache.org
>

RE: DocFormats - Open source OOXML implementation

Posted by "Dennis E. Hamilton" <de...@acm.org>.

I don't have any skin in this game.

Yet I am baffled about where this work is going on and what Apache Project it relates to.  Is there an incubator proposal for Apache DocFormats on its way?

In particular, I would expect that some thought would be given to the ODF Toolkit and that incubator project, <http://incubator.apache.org/odftoolkit/>.

Also, Apache POI would seem to have some relevance, especially the OpenXML4J component, <http://poi.apache.org/>.

These are all Java based, as is Armin's current project in the AOO repository.  I haven't listed open-source projects outside the embrace of ASF.

A single <orcnote> remark is in-line below (although this notation may derail defective HTML presentation of plaintext containing angle brackets).

Re-subscribing to general-incubator now ... 

Oh, and congratulations on joining the IPMC, Jan.

 -- Dennis E. Hamilton
    dennis.hamilton@acm.org    +1-206-779-9430
    https://keybase.io/orcmid  PGP F96E 89FF D456 628A
    X.509 certs used and requested for signed e-mail

-----Original Message-----
From: jan i [mailto:jani@apache.org] 
Sent: Saturday, August 16, 2014 01:10
To: dev
Subject: Re: DocFormats - Open source OOXML implementation

On 16 August 2014 03:50, Peter Kelly <ke...@gmail.com> wrote:

[ ... ]
> Now, onto the fix:
>
> The library needs to have some way of checking that the HTML file being
> used as part of an update operation has a mapping (id attributes) that
> match the docx file being updated (in the case of creating a new file, this
> is just an empty docx file). In the even that this is not the case, it
> could still do the update, but would act as if the entire document had been
> replaced with a completely new one.
>
> The solution I'll likely implement (and this should really be my first
> task, given the potential for problems like the above is this):
>
In my humble opinion you should not use time on this right now.

If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed)
If you start getting the documentation right we have a 1-n relations (1 man
used, n men help fix bugs).

Please have in mind, we build a community in order to move away from "I
have to do it, because I am the only one who know how" and you are the most
important enabler of that......we need your knowledge in a file, so that
others can work.

[ ... ]

When the project (hopefully) enters incubator, we will automatically have
access to a bug tracking system (jira), and with that hopefully only being
some month away I would not recommend setting up one now.

<orcnote>
   On Github, there is already an issues structure, 
   <https://github.com/uxproductivity/DocFormats/issues>.
   I think this should be continued in use until a different 
   setup arrives "any day soon".  Note that some Github projects 
   create a single subrepository that is just for its issues 
   function.  E.g., https://github.com/keybase/keybase-issues
</orcnote>

[ ... ]

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org

Re: DocFormats - Open source OOXML implementation

Posted by jan i <ja...@apache.org>.

On 16 August 2014 03:50, Peter Kelly <ke...@gmail.com> wrote:

> On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pe...@apache.org> wrote:
>
> On 15/08/2014 Peter Kelly wrote:
>
> Those of you interested in OOXML may want to have a look at my own
> implementation of (a subset of) the spec, which is part of a library
> I've just made available as open source (license is ASLv2):
> https://github.com/uxproductivity/DocFormats
>
>
> It's very interesting. I hope that in future it may become relevant to
> OpenOffice or to Apache at large.
>
> The design is based on bidirectional transformation, as a way of
> achieving non-destructive editing of foreign file formats. This permits
> incremental implementation of a given spec without risking data loss due
> to incomplete features, since unsupported features of a given file
> format are left untouched on save.
>
>
> Does this mean that
> $ dfutil/dfutil filename.docx filename.html
> $ dfutil/dfutil filename.html filename2.docx
> should produce a "filename2.docx" that is quite similar to
> "filename.docx"? It is failing rather badly (invalid OOXML output in the
> second conversion, ZIP container clearly missing files and possible
> breaking order) in a simple test I did with a 1-page docx file.
>
>
> I'm not surprised this is the first issue to come up :$ There's a *lot* of
> knowledge I need to document for others; questions from you and others are
> the best way to motivate me to get that written ;)
>
> What's happening here is that when the filename.html produced in the first
> step, each of its elements contains an id attribute containing a numeric
> identifier that refers to a specific element in the source docx file
> (specifically, the word/document.xml file within the package). These
> numeric identifiers are generated during parsing, and correspond to the
> position of the element in document order (so 1, 2, 3, etc.). When you
> convert from HTML to .docx, it uses the id attributes to re-establish these
> relationships, so that it knows which elements in the HTML file correspond
> to which elements in the .docx file.
>
> The problem you encountered stems from the fact that this mapping is only
> valid in specific circumstances - that is, when the .docx file being
> updated is exactly the same as its original. If this is not the case, then
> the identifier assigned to a given node will different whenever there are
> other nodes that have been inserted between it. So for example if you do
> the following:
>
> dfutil filename.docx filename.html
> # Modify filename.html
> dfutil filename.html filename.docx
> dfutil filename.html filename.docx
>
> Then the third run will fail, because in the second the docx file will
> have been updated based on the changes in the HTML, changing the sequence
> numbers assigned to each node, and then on the second run the mapping will
> be valid. The conversion works on the assumption that the docx file is the
> same as the original. The way that UX Write uses the library, it ensures
> this is the case, but the library does not check for this (and yes, it
> should; more on this below).
>
> Your case is similar, though in this case you're creating a new docx file,
> not updating an existing one. However what it actually does in this case is
> to create an empty .docx file, and then "update" that based on the HTML. In
> doing so, it assumes that the HTML does not contain any mappings (that is,
> id attributes with the prefix "bdt"). Since the filename.html you generated
> does, it tries to map these to elements in the docx file, failing badly.
>
> The only workaround for this at present is to manually edit the HTML file
> and remove all id attributes. The quickest way to do this is with the
> following command:
>
> sed -i '' -E ' s/ id="word[0-9]+"//' filename.html
>
> Then, when you run dfutil, it will see that there is no mapping for any of
> the elements in the HTML file, and thus avoid the problems in the output
> you observed.
>
> Now, onto the fix:
>
> The library needs to have some way of checking that the HTML file being
> used as part of an update operation has a mapping (id attributes) that
> match the docx file being updated (in the case of creating a new file, this
> is just an empty docx file). In the even that this is not the case, it
> could still do the update, but would act as if the entire document had been
> replaced with a completely new one.
>
> The solution I'll likely implement (and this should really be my first
> task, given the potential for problems like the above is this):
>
In my humble opinion you should not use time on this right now.

If you fix a bug we have a 1-1 relation (1 man used, 1 bug fixed)
If you start getting the documentation right we have a 1-n relations (1 man
used, n men help fix bugs).

Please have in mind, we build a community in order to move away from "I
have to do it, because I am the only one who know how" and you are the most
important enabler of that......we need your knowledge in a file, so that
others can work.



>
> - Include a hash of the .docx file (or relevant parts of it) in the HTML
> file, e.g. as a meta element or as part of the prefix on all id attributes
> - On update, have re-compute the hash of the .docx file and compare it
> against the one stored in the HTML file (if any), and if there's no match,
> treat the HTML file as a complete replacement of all content
>
>
>
> What is the best channel to report issues?
>
>  For now its surely to mail peter. but peter, can you please make a bug
directory, and put emails as plain text in there, so we have a reference.
Idealy the mails should be numbered, and the fixed cary the same number in
the commit text.

When the project (hopefully) enters incubator, we will automatically have
access to a bug tracking system (jira), and with that hopefully only being
some month away I would not recommend setting up one now.

@andrea thanks a lot for your test, and a little bit of background, peter
separated the closed source project and the new open source project, it
seems it was done a bit hastely (we are all highly motivated to get this
going).

@andrea, patches will be most welcome, due to a recommendation from jake f.
(infra) we have made the repo RO, but I or peter will make sure your
patches goes into the code base very quickly.

rgds
jan I

--
> Dr. Peter M. Kelly
> Founder, UX Productivity
> peter@uxproductivity.com
> http://www.uxproductivity.com/
> http://www.kellypmk.net/
>
> PGP key: http://www.kellypmk.net/pgp-key
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

Re: DocFormats - Open source OOXML implementation

Posted by Peter Kelly <ke...@gmail.com>.

On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pe...@apache.org> wrote:

> On 15/08/2014 Peter Kelly wrote:
>> Those of you interested in OOXML may want to have a look at my own
>> implementation of (a subset of) the spec, which is part of a library
>> I've just made available as open source (license is ASLv2):
>> https://github.com/uxproductivity/DocFormats
> 
> It's very interesting. I hope that in future it may become relevant to OpenOffice or to Apache at large.
> 
>> The design is based on bidirectional transformation, as a way of
>> achieving non-destructive editing of foreign file formats. This permits
>> incremental implementation of a given spec without risking data loss due
>> to incomplete features, since unsupported features of a given file
>> format are left untouched on save.
> 
> Does this mean that
> $ dfutil/dfutil filename.docx filename.html
> $ dfutil/dfutil filename.html filename2.docx
> should produce a "filename2.docx" that is quite similar to "filename.docx"? It is failing rather badly (invalid OOXML output in the second conversion, ZIP container clearly missing files and possible breaking order) in a simple test I did with a 1-page docx file.

I'm not surprised this is the first issue to come up :$ There's a *lot* of knowledge I need to document for others; questions from you and others are the best way to motivate me to get that written ;)

What's happening here is that when the filename.html produced in the first step, each of its elements contains an id attribute containing a numeric identifier that refers to a specific element in the source docx file (specifically, the word/document.xml file within the package). These numeric identifiers are generated during parsing, and correspond to the position of the element in document order (so 1, 2, 3, etc.). When you convert from HTML to .docx, it uses the id attributes to re-establish these relationships, so that it knows which elements in the HTML file correspond to which elements in the .docx file.

The problem you encountered stems from the fact that this mapping is only valid in specific circumstances - that is, when the .docx file being updated is exactly the same as its original. If this is not the case, then the identifier assigned to a given node will different whenever there are other nodes that have been inserted between it. So for example if you do the following:

dfutil filename.docx filename.html
# Modify filename.html
dfutil filename.html filename.docx
dfutil filename.html filename.docx

Then the third run will fail, because in the second the docx file will have been updated based on the changes in the HTML, changing the sequence numbers assigned to each node, and then on the second run the mapping will be valid. The conversion works on the assumption that the docx file is the same as the original. The way that UX Write uses the library, it ensures this is the case, but the library does not check for this (and yes, it should; more on this below).

Your case is similar, though in this case you're creating a new docx file, not updating an existing one. However what it actually does in this case is to create an empty .docx file, and then "update" that based on the HTML. In doing so, it assumes that the HTML does not contain any mappings (that is, id attributes with the prefix "bdt"). Since the filename.html you generated does, it tries to map these to elements in the docx file, failing badly.

The only workaround for this at present is to manually edit the HTML file and remove all id attributes. The quickest way to do this is with the following command:

sed -i '' -E ' s/ id="word[0-9]+"//' filename.html

Then, when you run dfutil, it will see that there is no mapping for any of the elements in the HTML file, and thus avoid the problems in the output you observed.

Now, onto the fix:

The library needs to have some way of checking that the HTML file being used as part of an update operation has a mapping (id attributes) that match the docx file being updated (in the case of creating a new file, this is just an empty docx file). In the even that this is not the case, it could still do the update, but would act as if the entire document had been replaced with a completely new one.

The solution I'll likely implement (and this should really be my first task, given the potential for problems like the above is this):

- Include a hash of the .docx file (or relevant parts of it) in the HTML file, e.g. as a meta element or as part of the prefix on all id attributes
- On update, have re-compute the hash of the .docx file and compare it against the one stored in the HTML file (if any), and if there's no match, treat the HTML file as a complete replacement of all content

> 
> What is the best channel to report issues?

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: DocFormats - Open source OOXML implementation

Posted by Peter Kelly <ke...@gmail.com>.

On 16 Aug 2014, at 5:26 am, Andrea Pescetti <pe...@apache.org> wrote:

> Does this mean that
> $ dfutil/dfutil filename.docx filename.html
> $ dfutil/dfutil filename.html filename2.docx
> should produce a "filename2.docx" that is quite similar to "filename.docx"? It is failing rather badly (invalid OOXML output in the second conversion, ZIP container clearly missing files and possible breaking order) in a simple test I did with a 1-page docx file.
> 
> What is the best channel to report issues?

Currently just email to me (or here on the list), but we should ideally get a dedicated mailing list/bug tracking system set up for it soon.

--
Dr. Peter M. Kelly
Founder, UX Productivity
peter@uxproductivity.com
http://www.uxproductivity.com/
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: DocFormats - Open source OOXML implementation

Posted by Andrea Pescetti <pe...@apache.org>.

On 15/08/2014 Peter Kelly wrote:
> Those of you interested in OOXML may want to have a look at my own
> implementation of (a subset of) the spec, which is part of a library
> I've just made available as open source (license is ASLv2):
> https://github.com/uxproductivity/DocFormats

It's very interesting. I hope that in future it may become relevant to 
OpenOffice or to Apache at large.

> The design is based on bidirectional transformation, as a way of
> achieving non-destructive editing of foreign file formats. This permits
> incremental implementation of a given spec without risking data loss due
> to incomplete features, since unsupported features of a given file
> format are left untouched on save.

Does this mean that
$ dfutil/dfutil filename.docx filename.html
$ dfutil/dfutil filename.html filename2.docx
should produce a "filename2.docx" that is quite similar to 
"filename.docx"? It is failing rather badly (invalid OOXML output in the 
second conversion, ZIP container clearly missing files and possible 
breaking order) in a simple test I did with a 1-page docx file.

What is the best channel to report issues?

> I'll be presenting on this at ApacheCon EU this November - see the talk
> "Addressing File Format Compatibility in Word Processors" at
> http://apacheconeu2014.sched.org

Looking forward to see it live!

Regards,
   Andrea.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@openoffice.apache.org
For additional commands, e-mail: dev-help@openoffice.apache.org