You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@corinthia.apache.org by Peter Kelly <pm...@apache.org> on 2015/01/07 14:57:55 UTC

ODF filter

I mentioned in my last mail the topic of writing an ODF filter. I realise the codebase is pretty difficult to navigate right now due to lack of documentation, so I thought I’d get the discussion started by outlining how I would suggest we proceed with this, based on my experience writing the Word filter (I tend to use the term “Word” rather than OOXML, since the currently implementation details only with the word processing subset of the spec; similarly for ODF for now).

At a high-level, each filter needs to provide three operations: get, put, and create. These operate on “abstract” and “concrete” documents - an abstract document is in HTML format (our common intermediate representation) and the concrete document is in format which the filter is implementing (in this case, .odt).

The get operation will need to convert from ODT to HTML, and include id attributes in the HTML file that allow elements in the latter to be correlated with elements in the former. In the Word filter, the ids are based on the index of the node in a pre-order traversal of the tree. These are used to look up elements during the put operation, so we know which element to update.

The put operation will need to accept an existing ODT document, and update it based on a modified version of the HTML file that was previously obtained from the get operation. The way I did this in the word filter was to traverse both trees in “parallel”, determining what had changed (and using the element mappings based on id attributes), making changes to the original document as appropriate. In the case of formatting attributes, this involved re-generating the CSS from the concrete document, comparing which attributes had changed, and then applying the necessary changes to the formatting elements in the concrete document. In the case of content, this was handled differently, generally simply overwriting.

During traversal, the functions in DFBDT.c can be used to handle case where the children of a given element have been re-ordered (e.g. someone moved a paragraph to different position in the document). This uses the id mappings in the HTML to figure out what elements in the concrete document they correspond to, and when it sees them in a different order, it moves some of them so that they come to match the order in which the corresponding HTML elements appear. Unsupported elements are left untouched by this process.

The create operation will need to produce a brand new ODT file based on a HTML file. This can simply be implemented by creating an empty ODT file, and then doing a put operation - it’s essentially “updating” an empty document to which new content has been added.

The entry points for these three functions are DFGet, DFPut, and DFCreate in api/src/Operations.c. These each have a switch statement which looks at the file type and calls through to a function in the appropriate filter to do the conversion. In the future we may need a more generic/pluggable way of doing this, but for the time being, defining three functions ODTGet, ODTPut, and ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate functions) and adding cases to the switch statements for these will be sufficient.

It’s probably best to start off by having a look at these functions in filters/ooxml/src/word/Word.c and following the code through there. If you’re using Xcode, you can easily jump through the function call graph to go to the implementation of a called function; I expect visual studio probably has something similar. At any rate, I’ve mostly chosen function names that are not prefixes of other function names, so it should be fairly easy to find the function you’re looking for with grep if you don’t know what file it’s in (this is something I love about C, which you can’t do so easily using object-oriented languages).

The Word filter has two core classes used during conversion - WordPackage and WordConverter (defined in their respective .h and .c files). A word package encapsulates a .docx file, and contains data structures loaded from the XML files stored within the .docx package (which is actually a zip file). There are classes for things like the stylesheet, numbering information, the set of footnotes/endnotes, and so forth. For ODF,  I already did a little bit of work a while back defining skeleton versions of the corresponding classes (ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be a suitable place to put the get/put/create functions.

Data structures used in ODF differ somewhat from those of Word documents, though there is a lot of conceptual similarity. The most significant difference I can think of is the way that direct formatting is handled - ODF treats *everything* as a style; if you apply direct formatting to a run of text, then it creates what’s called an “automatic style” and references that from the content. So styles, formatting, numbering, and numerous other things will have to be represented differently, but much of the strategies used in the word filter should carry across fairly easily. I need to document these better, but perhaps it’s easiest if you get stuck to ask me questions, and then we can put these on the wiki or in the source documentation.

Anyway, this is just a braindump of what I think the most relevant things someone implementing an ODF filter will need to know. I’d love to be be pestered with more questions about this, as I think getting started on this important task would be a good step forward for the project, and demonstrate our commitment to making interoperability easier for people.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: ODF filter

Posted by Peter Kelly <pm...@apache.org>.

> On 8 Jan 2015, at 10:59 pm, Peter Kelly <pm...@apache.org> wrote:
> 
>> On 8 Jan 2015, at 10:16 am, Dave Fisher <da...@comcast.net> wrote:
>> 
>> Hi Peter,
>> 
>> This is a helpful email from your concrete discussion I can better understand the mapping between the abstract / HTML model and the concrete / DOCX, ODT.
>> 
>> You mention differences in the style runs for Word and ODT of which I am familiar from the OOXML side. Does the abstract model / HTML take a particular approach towards style runs? Is there a concrete version of the HTML model? Is there a specification or plan for the abstract model?
> 
> As a general principle, no - a given filter is expected to handle arbitrary HTML.
> 
> However, there is a function for “normalising” a HTML document to change nested sets of inline elements (span, b, i, etc.) into a flat sequence of runs (each represented as a span element). The Word filter uses this, due to Word’s flat model of inline runs.

Just thought I’d add a bit more detail on this, for anyone interested in exploring the implementation:

For .docx files, DFPut (api/src/Operations.c) calls WordPut (filters/ooxml/src/word/Word.c), which in turn creates a WordPackage object and then calls WordPackageUpdateFromHTML (filters/ooxml/src/word/WordPackage.c). The very first thing this does is to call HTML_normalizeDocument and HTML_pushDownInlineProperties (both in core/src/html/DFHTMLNormalization.c).

HTML_normalizeDocument merges adjacent text nodes (which in theory shouldn’t be necessary, but I found that sometimes libxml’s parser produces two or more in a row), and then goes through all the block-level elements, flattening any inline elements such that the resulting block node contains a series of spans, each with a style attribute set with the appropriate css formatting properties. For example, if you start with this:

<p>
    Here
    <b>
        is
        <i>
            some
        </i>
        text
    </b>
</p>

then you’ll end up with this:

<p>
    <span>Here</span>
    <span style=“font-weight: bold">
        is
    </span>
    <span style=“font-weight: bold; font-style: italic">
        some
    </span>
    <span style=“font-weight: bold">
        text
    </span>
</p>

HTML_pushDownInlineProperties checks block elements for any CSS properties that can be applied to inline formatting (such as font family, font size, text color) and moves them to the style attributes of the span elements within the block element. For example, the following:

<p style=“border: 1px solid black; font-size: 18”>
    <span>Some text</span>
</p>

would become this:

<p style=“border: 1px solid black”>
    <span style=“font-size: 18">Some text</span>
</p>

Both of these are pre-processing stages that happen before the primary traversal of the document tree begins, and the latter code in the Word filter expects the HTML documents to confirm to this more restrictive “dialect”. In the case of the inline properties, it’s because these settings have to go on the rPr elements in a word document, and are not allowed on the pPr elements (that is, Word is more strict in terms of which formatting properties can be set where; HTML allows you to set “inline” formatting properties on any element using a style attribute). So this pre-processing is largely to match the needs of the Word filter, but it’s likely that an ODF text document filter will need some pre-processing as well.

As we add more formats, I expect we’ll discover some common places where there the HTML input needs to be normalised to a certain form, and also places where it is better to leave it as-is. The ability to have nested inline elements in ODF is an example of the latter; we can probably avoid HTML_normalizeDocument in that case by having a direct relationship between HTML inline elements and ODF text-span elements. Depending on the situation, retaining such structure may be important - but that’s something I expect we’ll discover as we proceed with implementation.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: ODF filter

Posted by jan i <ja...@apache.org>.

On 8 January 2015 at 16:59, Peter Kelly <pm...@apache.org> wrote:

> > On 8 Jan 2015, at 10:16 am, Dave Fisher <da...@comcast.net> wrote:
> >
> > Hi Peter,
> >
> > This is a helpful email from your concrete discussion I can better
> understand the mapping between the abstract / HTML model and the concrete /
> DOCX, ODT.
> >
> > You mention differences in the style runs for Word and ODT of which I am
> familiar from the OOXML side. Does the abstract model / HTML take a
> particular approach towards style runs? Is there a concrete version of the
> HTML model? Is there a specification or plan for the abstract model?
>
> As a general principle, no - a given filter is expected to handle
> arbitrary HTML.
>
> However, there is a function for “normalising” a HTML document to change
> nested sets of inline elements (span, b, i, etc.) into a flat sequence of
> runs (each represented as a span element). The Word filter uses this, due
> to Word’s flat model of inline runs.
>
> ODF text documents, on the other hand, *do* support nested formatting
> runs, so when writing this filter it may make sense not to apply the
> normalisation process used in the word filter. This should be done if there
> is information that could not be represented in HTML and would be lost by
> flattening the structure like we do for word.
>
> There’s been a few times where the topic of what internal representation
> we should use has been raised - whether we should stick with HTML, come up
> with our own entirely different model, or something else. I personally
> think HTML is a good choice, but perhaps for those who have raised the
> issue of an alternate intermediate form, this might be a good time to start
> that discussion ;)
>

Point taken, I am I assume the first who questioned it. But just to be
precise, I am happy having HTML as the internal structure, but I am unhappy
that filters can do what they like with the HTML. My goal is to define a
set of access functions that filters should use to navigate/insert/delete
tags and restrictions on what can be put in the tags. Just image one filter
needs to id some tags, therefore uses id=, another filter needs to name
some tags, therefore uses name=. If we are not careful here it will explode
and reading HTML becomes nearly as complicated as reading the formats
directly. We should have 1 and only 1 HTML definition, which the filters
can use.

rgds
jan I.

>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

Re: ODF filter

Posted by Peter Kelly <pm...@apache.org>.

> On 8 Jan 2015, at 10:16 am, Dave Fisher <da...@comcast.net> wrote:
> 
> Hi Peter,
> 
> This is a helpful email from your concrete discussion I can better understand the mapping between the abstract / HTML model and the concrete / DOCX, ODT.
> 
> You mention differences in the style runs for Word and ODT of which I am familiar from the OOXML side. Does the abstract model / HTML take a particular approach towards style runs? Is there a concrete version of the HTML model? Is there a specification or plan for the abstract model?

As a general principle, no - a given filter is expected to handle arbitrary HTML.

However, there is a function for “normalising” a HTML document to change nested sets of inline elements (span, b, i, etc.) into a flat sequence of runs (each represented as a span element). The Word filter uses this, due to Word’s flat model of inline runs.

ODF text documents, on the other hand, *do* support nested formatting runs, so when writing this filter it may make sense not to apply the normalisation process used in the word filter. This should be done if there is information that could not be represented in HTML and would be lost by flattening the structure like we do for word.

There’s been a few times where the topic of what internal representation we should use has been raised - whether we should stick with HTML, come up with our own entirely different model, or something else. I personally think HTML is a good choice, but perhaps for those who have raised the issue of an alternate intermediate form, this might be a good time to start that discussion ;)

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: ODF filter

Posted by Dave Fisher <da...@comcast.net>.

Hi Peter,

This is a helpful email from your concrete discussion I can better understand the mapping between the abstract / HTML model and the concrete / DOCX, ODT.

You mention differences in the style runs for Word and ODT of which I am familiar from the OOXML side. Does the abstract model / HTML take a particular approach towards style runs? Is there a concrete version of the HTML model? Is there a specification or plan for the abstract model?

I also think that one approach towards other file format filters that could be interesting would be to focus on PUT functionality before GET. Understanding how to write a proper document is the first step towards reading documents in all of the historical variations. PDF is a classic example of this. Adobe has always done well defining what a valid PDF document looks like, but there are after 24 years myriad variants that are valid.

Regards,
Dave

On Jan 7, 2015, at 5:57 AM, Peter Kelly wrote:

> I mentioned in my last mail the topic of writing an ODF filter. I realise the codebase is pretty difficult to navigate right now due to lack of documentation, so I thought I’d get the discussion started by outlining how I would suggest we proceed with this, based on my experience writing the Word filter (I tend to use the term “Word” rather than OOXML, since the currently implementation details only with the word processing subset of the spec; similarly for ODF for now).
> 
> At a high-level, each filter needs to provide three operations: get, put, and create. These operate on “abstract” and “concrete” documents - an abstract document is in HTML format (our common intermediate representation) and the concrete document is in format which the filter is implementing (in this case, .odt).
> 
> The get operation will need to convert from ODT to HTML, and include id attributes in the HTML file that allow elements in the latter to be correlated with elements in the former. In the Word filter, the ids are based on the index of the node in a pre-order traversal of the tree. These are used to look up elements during the put operation, so we know which element to update.
> 
> The put operation will need to accept an existing ODT document, and update it based on a modified version of the HTML file that was previously obtained from the get operation. The way I did this in the word filter was to traverse both trees in “parallel”, determining what had changed (and using the element mappings based on id attributes), making changes to the original document as appropriate. In the case of formatting attributes, this involved re-generating the CSS from the concrete document, comparing which attributes had changed, and then applying the necessary changes to the formatting elements in the concrete document. In the case of content, this was handled differently, generally simply overwriting.
> 
> During traversal, the functions in DFBDT.c can be used to handle case where the children of a given element have been re-ordered (e.g. someone moved a paragraph to different position in the document). This uses the id mappings in the HTML to figure out what elements in the concrete document they correspond to, and when it sees them in a different order, it moves some of them so that they come to match the order in which the corresponding HTML elements appear. Unsupported elements are left untouched by this process.
> 
> The create operation will need to produce a brand new ODT file based on a HTML file. This can simply be implemented by creating an empty ODT file, and then doing a put operation - it’s essentially “updating” an empty document to which new content has been added.
> 
> The entry points for these three functions are DFGet, DFPut, and DFCreate in api/src/Operations.c. These each have a switch statement which looks at the file type and calls through to a function in the appropriate filter to do the conversion. In the future we may need a more generic/pluggable way of doing this, but for the time being, defining three functions ODTGet, ODTPut, and ODTCreate (corresponding to the existing WordGet, WordPut, and WordCreate functions) and adding cases to the switch statements for these will be sufficient.
> 
> It’s probably best to start off by having a look at these functions in filters/ooxml/src/word/Word.c and following the code through there. If you’re using Xcode, you can easily jump through the function call graph to go to the implementation of a called function; I expect visual studio probably has something similar. At any rate, I’ve mostly chosen function names that are not prefixes of other function names, so it should be fairly easy to find the function you’re looking for with grep if you don’t know what file it’s in (this is something I love about C, which you can’t do so easily using object-oriented languages).
> 
> The Word filter has two core classes used during conversion - WordPackage and WordConverter (defined in their respective .h and .c files). A word package encapsulates a .docx file, and contains data structures loaded from the XML files stored within the .docx package (which is actually a zip file). There are classes for things like the stylesheet, numbering information, the set of footnotes/endnotes, and so forth. For ODF,  I already did a little bit of work a while back defining skeleton versions of the corresponding classes (ODFPackage, ODFManifest, and ODFSheet). The file ODF.c is empty but would be a suitable place to put the get/put/create functions.
> 
> Data structures used in ODF differ somewhat from those of Word documents, though there is a lot of conceptual similarity. The most significant difference I can think of is the way that direct formatting is handled - ODF treats *everything* as a style; if you apply direct formatting to a run of text, then it creates what’s called an “automatic style” and references that from the content. So styles, formatting, numbering, and numerous other things will have to be represented differently, but much of the strategies used in the word filter should carry across fairly easily. I need to document these better, but perhaps it’s easiest if you get stuck to ask me questions, and then we can put these on the wiki or in the source documentation.
> 
> Anyway, this is just a braindump of what I think the most relevant things someone implementing an ODF filter will need to know. I’d love to be be pestered with more questions about this, as I think getting started on this important task would be a good step forward for the project, and demonstrate our commitment to making interoperability easier for people.
> 
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
> 
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>