You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@corinthia.apache.org by "Dennis E. Hamilton" <de...@acm.org> on 2015/01/08 17:40:39 UTC

Corinthia Document Model (was RE: ODF filter)

 -- reply below to --
From: jan i [mailto:jani@apache.org] 
Sent: Thursday, January 8, 2015 08:12
To: dev@corinthia.incubator.apache.org
Subject: Re: ODF filter

On 8 January 2015 at 16:59, Peter Kelly <pm...@apache.org> wrote:

[ ... ]

> As a general principle, no - a given filter is expected to handle
> arbitrary HTML.
>
> However, there is a function for “normalising” a HTML document to change
> nested sets of inline elements (span, b, i, etc.) into a flat sequence of
> runs (each represented as a span element). The Word filter uses this, due
> to Word’s flat model of inline runs.
>
> ODF text documents, on the other hand, *do* support nested formatting
> runs, so when writing this filter it may make sense not to apply the
> normalisation process used in the word filter. This should be done if there
> is information that could not be represented in HTML and would be lost by
> flattening the structure like we do for word.
>
> There’s been a few times where the topic of what internal representation
> we should use has been raised - whether we should stick with HTML, come up
> with our own entirely different model, or something else. I personally
> think HTML is a good choice, but perhaps for those who have raised the
> issue of an alternate intermediate form, this might be a good time to start
> that discussion ;)
>

Point taken, I am I assume the first who questioned it. But just to be
precise, I am happy having HTML as the internal structure, but I am unhappy
that filters can do what they like with the HTML. My goal is to define a
set of access functions that filters should use to navigate/insert/delete
tags and restrictions on what can be put in the tags. Just image one filter
needs to id some tags, therefore uses id=, another filter needs to name
some tags, therefore uses name=. If we are not careful here it will explode
and reading HTML becomes nearly as complicated as reading the formats
directly. We should have 1 and only 1 HTML definition, which the filters
can use.

rgds
jan I.

<orcmid>
  I'm not following this well.  
  Let me ask it this way: Are we talking about fixing some sort of DOM over
  the HTML5 or are we allowing arbitrary HTML5 and transforming to and from
  it? 
 
  I am having trouble visualizing this process -- is the intermediate
  concrete HTML and not some DOM view?

  This relates to how inter-conversion is to be tested.  Is there some 
  abstraction against which document features are assessed and mapped
  through or are we working concrete level to/from concrete level and
  that is essentially it?

  Help me calibrate my understanding of the thrust.
</orcmid>



Re: Corinthia Document Model (was RE: ODF filter)

Posted by Peter Kelly <pm...@apache.org>.
> On 9 Jan 2015, at 12:02 am, jan i <ja...@apache.org> wrote:
> 
> Without polluting with all the function calls, let me try to explain, how I
> see the current source (peter@ please correct me if I am wrong).
> 
> a filter can in principle inject any HTML5 string into the datamodel. Core
> delivers functions to manipulate the HTML5 model, but does not control what
> happens.
> 
> Meaning if a filter wants to write "<p style=janPrivate,
> idJan=nogo>foo</p>" to the data, it can do that. The problem with that is
> that all the other filters need to understand this, when reading data and
> generating their format.

Just to clarify on the representation - it's a DOM-like model, in that we have a tree data structure with nodes (elements and text nodes), where elements can have attributes. It's very similar to the W3C DOM but some of the function names and field names are different, and it doesn't use inheritance (due to C being the implementation language). There is no string concatenation going on during conversion - the DOM tree is parsed and serialised to XML or HTML in the standard fashion.

> 
> My idea is that core should provide function like (just an example)
>   addParagraph(*style, *id, *text)
> Doing that means a filter cannot write arbitrary HTML5 but only what is
> "allowed". If a filter need a new capability, core would be extended in a
> controlled fashion and all filters updated.

One approach - admittedly radical (but don't let that stop us) - is to enforce this at the level of the type system, based on the HTML DTD, as well as possibly the XML schema definition for the individual file formats. Unfortunately however, C's type system isn't really powerful enough to express the sort of constraints we'd want to enforce; Haskell is the only language I know of which is.

The parsing toolkit I'm working on (based on PEG - see http://bford.info/packrat/) takes a grammar as input and produces a syntax tree (currently in a custom data structure, but could easily produce the syntax tree in XML or similar). I'm interested in taking this idea further, and making the grammar and type system one and the same, and use this to define a high-level functional language in which transformations could be expressed. Things like union types are really important here, which Haskell does well but few other languages, but the concept of union types has been alive and well in formal grammars since the beginning - that is, multiple different possible ways of matching a given production.

I've worked a lot with Stratego/XT (http://strategoxt.org) in the past and have been inspired by it's unique to approach to expressing language transformations. I think something like this would be very well suited to what we want to do. My main problem with Stratego however is it's untyped; you can't enforce the restriction that a particular transformation results in a particular type/structure, nor can you specify the types of structure it accepts. I think a language that merges the concepts of stratego's transformation stategy, haskell's type system, and PEG-based formal grammars would be a very powerful and elegant way to achieve our goals.

My primary motivation for using formal grammars is to give us the ability to handle non-XML based languages, such as Markdown, RTF, LaTeX etc. With a suitable parser implementation, we can deal with these just as easily as we can with any other XML-based structure - and in fact we could even move to a higher level of abstraction where XML is just a special case of the more general type system. XML Schema and Relax NG (used for the OOXML and ODF specs respectively, if I remember correctly) could also be used as inputs to the type system, and used for static typing.

A programming language of this nature would allow us to formally specify the exact nature of the intermediate form (be it a dialect of HTML or otherwise), and get static type checking of the transformation code to a degree that can't be achieved with C/C++ or other similar languages. More static type checking also has the potential to reduce the number of required testcases, as we can eliminate whole classes of errors through the type system.

>>  This relates to how inter-conversion is to be tested.  Is there some
>>  abstraction against which document features are assessed and mapped
>>  through or are we working concrete level to/from concrete level and
>>  that is essentially it?
>> 
> I dont think we should test inter-conversion as such. It is much more
> efficient to format xyz <-> HTML5. And if our usage of HTML5 is defined
> (and restricted) it should work.

Agreed. Think of it like the frontend and backend parts of a compiler. If you want to support N languages on M CPU architectures, then you would generally have a CPU-independent intermediate representation (essentially a high-level assembly language). You write a frontend for each of the N languages which targets this intermediate, abstract machine (including language-specific optimisations). You also write a backend for each of the M target CPU architectures (including architecture-specific optimisations). You then need N+M tests, instead of N*M.

In our case, HTML is the "intermediate architecture", or more appropriately, "intermediate format". Each filter knows about it's own format (e.g. .docx) and HTML. It solely deals with the conversion between these formats.

If you want to convert from say .docx to .odt, then you first go through HTML as an intermediate step. So the file gets converted from .docx to HTML, and then from HTML to .odt.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


Re: Corinthia Document Model (was RE: ODF filter)

Posted by jan i <ja...@apache.org>.
On 8 January 2015 at 17:40, Dennis E. Hamilton <de...@acm.org>
wrote:

>  -- reply below to --
> From: jan i [mailto:jani@apache.org]
> Sent: Thursday, January 8, 2015 08:12
> To: dev@corinthia.incubator.apache.org
> Subject: Re: ODF filter
>
> On 8 January 2015 at 16:59, Peter Kelly <pm...@apache.org> wrote:
>
> [ ... ]
>
> > As a general principle, no - a given filter is expected to handle
> > arbitrary HTML.
> >
> > However, there is a function for “normalising” a HTML document to change
> > nested sets of inline elements (span, b, i, etc.) into a flat sequence of
> > runs (each represented as a span element). The Word filter uses this, due
> > to Word’s flat model of inline runs.
> >
> > ODF text documents, on the other hand, *do* support nested formatting
> > runs, so when writing this filter it may make sense not to apply the
> > normalisation process used in the word filter. This should be done if
> there
> > is information that could not be represented in HTML and would be lost by
> > flattening the structure like we do for word.
> >
> > There’s been a few times where the topic of what internal representation
> > we should use has been raised - whether we should stick with HTML, come
> up
> > with our own entirely different model, or something else. I personally
> > think HTML is a good choice, but perhaps for those who have raised the
> > issue of an alternate intermediate form, this might be a good time to
> start
> > that discussion ;)
> >
>
> Point taken, I am I assume the first who questioned it. But just to be
> precise, I am happy having HTML as the internal structure, but I am unhappy
> that filters can do what they like with the HTML. My goal is to define a
> set of access functions that filters should use to navigate/insert/delete
> tags and restrictions on what can be put in the tags. Just image one filter
> needs to id some tags, therefore uses id=, another filter needs to name
> some tags, therefore uses name=. If we are not careful here it will explode
> and reading HTML becomes nearly as complicated as reading the formats
> directly. We should have 1 and only 1 HTML definition, which the filters
> can use.
>
> rgds
> jan I.
>
> <orcmid>
>   I'm not following this well.
>   Let me ask it this way: Are we talking about fixing some sort of DOM over
>   the HTML5 or are we allowing arbitrary HTML5 and transforming to and from
>   it?
>
>   I am having trouble visualizing this process -- is the intermediate
>   concrete HTML and not some DOM view?
>

you are not the only one, it took me quite some evenings to get just a bit
into the code.

Without polluting with all the function calls, let me try to explain, how I
see the current source (peter@ please correct me if I am wrong).

a filter can in principle inject any HTML5 string into the datamodel. Core
delivers functions to manipulate the HTML5 model, but does not control what
happens.

Meaning if a filter wants to write "<p style=janPrivate,
idJan=nogo>foo</p>" to the data, it can do that. The problem with that is
that all the other filters need to understand this, when reading data and
generating their format.

My idea is that core should provide function like (just an example)
   addParagraph(*style, *id, *text)
Doing that means a filter cannot write arbitrary HTML5 but only what is
"allowed". If a filter need a new capability, core would be extended in a
controlled fashion and all filters updated.


>   This relates to how inter-conversion is to be tested.  Is there some
>   abstraction against which document features are assessed and mapped
>   through or are we working concrete level to/from concrete level and
>   that is essentially it?
>
I dont think we should test inter-conversion as such. It is much more
efficient to format xyz <-> HTML5. And if our usage of HTML5 is defined
(and restricted) it should work.



>
>   Help me calibrate my understanding of the thrust.
>
hope it helps a bit...if not please ask again, because this is a real
crucial point we all need to agree on.

rgds
jan i

> </orcmid>
>
>
>