You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@corinthia.apache.org by Gabriela Gibson <ga...@gmail.com> on 2015/05/23 01:36:20 UTC

ODF branch: The confused edition.

Hi,

Well, I managed to get (rudimentary) headers, tables, lists working, but
the bold, italic and underlined nodes have me confused, mostly because
nothing appears in the order I expect it to.  I used the
file sample/documents/odf/bold-italic-underlined.odt for that part.

I also had to do some surgery on DocFormats/core/src/xml/DFNameMap.* so I
could access DFNameMap.

Please see the logmessage for the (gory) details.

https://github.com/apache/incubator-corinthia/commit/295caa80be4cfc37e0e7bdd4aa3fcc6ad7b1a1b6

G

-- 
Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

Re: ODF branch: The confused edition.

Posted by Ian C <ia...@amham.net>.
Hi Gabriela,

I have merged your branch into mine via git.

I can see the changes.

I do not know how to run them or exactly what you are attempting.

The test documents you have created, are you trying to get them into the DF
internal format and then spit them out as HTML? Or did I miss something and
HTML is the intermediate format?

I have been trying to see how the code grabs the document and then what it
does with it. Is it the case that you are at the stage of trying to read
the input structure?

One of the things that is going to make processing ODF documents is the
automatic styles.
This document adds in a couple. And the bold text in one of the paragraphs
is in span from the automatic styles.

Thinking about how this could/should be juggled. As Peter says creating an
in memory  styles database first is probably the way to start.
For early iterations I think we should create ODF documents that do not use
automatic styles. It is not always easy to see if that is the case without
examining the content.xml though. But hey I have a way of doing that...I'll
email later.

If I can figure out how the code base works I will have a go at trying to
help you.
Any pointers gratefully accepted :-)

On Sun, May 24, 2015 at 2:00 AM, Peter Kelly <pm...@apache.org> wrote:

> > On 23 May 2015, at 6:36 am, Gabriela Gibson <ga...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Well, I managed to get (rudimentary) headers, tables, lists working, but
> > the bold, italic and underlined nodes have me confused, mostly because
> > nothing appears in the order I expect it to.  I used the
> > file sample/documents/odf/bold-italic-underlined.odt for that part.
>
> I haven’t yet gone and run the code, but looking at the approaches used in
> traverseContent() where it calls find_HTML() to determine the corresponding
> HTML tag for a given ODF node, I don’t think this is going to work for a
> lot of the constructs in ODF. The reason is that it’s not always a simple
> mapping - for some basic constructs like paragraphs and (some) tables it
> will work, but there will be other cases where more complex processing is
> needed. So I think, at least for the time being (the very interesting DSL
> ideas we’ve been discussing notwithstanding), a first cut that has one big
> switch statement for all the supported node types is more likely to be
> successful. This way, you can do any arbitrary processing you need for a
> given node type, and are not restricted to simply mapping it to a
> particular HTML element.
>
> In terms of the formatting notes like those for italic and bold, I would
> suggest instead building up a set of CSS properties rather than creating
> HTML tags for <b>, <i>, and <u>. The reason for this is that there are only
> a few such tags in HTML, but there are many other formatting properties
> that can’t be expressed in this manner and instead use CSS. An <span>
> element with style=“font-weight: bold” attribute is equivalent to <b>, and
> there’s some code somewhere in the html directory which from memory I think
> converts between the two. So creating a CSSProperties object and setting
> the relevant name/value pairs in that will enable you to serialise the
> result and place that in a span tag.
>
> The other reason the CSS approach is more appropriate is that it can also
> be used for stylesheets. For automatic styles in ODF, we want to translate
> those to style=“…” attributes in HTML (that is, direct formatting, which is
> essentially what automatic styles are). However for normal styles, we want
> an entry in the CSS stylesheet, and then reference that from the HTML
> element via the class=“…” attribute.
>
> Have a look in the ooxml/src/word/formatting directly for how this is
> handled in the Word filter. This takes an XML node from the Word document
> as input, and populates a CSSProperty object with the appropriate values.
> There are also functions to go the other way, when performing an update. I
> would recommend an approach similar to this.
>
> Coming back to HTML_B and friends: I just had a look at
> HTMLNormalization.c and it looks like it only does this in the inverse
> situation to what I described above. That is, when reading a HTML file and
> preparing it for conversion into a Word document, it converts <b>, <i>, <u>
> etc into <span> tags with the appropriate CSS properties set in the style
> attribute. It doesn’t go the other way, though that could potentially be
> done. Both approaches are essentially identical anyway in terms of how they
> will render in a browser and be treated by the editor.
>
> >
> > I also had to do some surgery on DocFormats/core/src/xml/DFNameMap.* so I
> > could access DFNameMap.
>
> It isn’t actually necessary to put this stuff in the header - it’s best to
> keep the struct definition in the C file and only ever access it through
> the functions exposed in the header. If you’re not accessing any of the
> fields of DFNameMap (which you’re not, at least in the code currently in
> the repository), then the compiler simply needs to know that there exists a
> struct type called DFNameMap, without knowing what it’s fields actually
> are. The following line in DFNameMap.h declares the typedef:
>
> typedef struct DFNameMap DFNameMap;
>
> Everything you need to do with the name map can be achieved with the
> public functions - and in the event you find something that can’t be done,
> it’s better to either add a new function. Though this shouldn’t be
> necessary; if you find such situations let me know and I’ll explain how to
> do it with the existing functions :)
>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>


-- 
Cheers,

Ian C

Re: ODF branch: The confused edition.

Posted by Peter Kelly <pm...@apache.org>.
> On 23 May 2015, at 6:36 am, Gabriela Gibson <ga...@gmail.com> wrote:
> 
> Hi,
> 
> Well, I managed to get (rudimentary) headers, tables, lists working, but
> the bold, italic and underlined nodes have me confused, mostly because
> nothing appears in the order I expect it to.  I used the
> file sample/documents/odf/bold-italic-underlined.odt for that part.

I haven’t yet gone and run the code, but looking at the approaches used in traverseContent() where it calls find_HTML() to determine the corresponding HTML tag for a given ODF node, I don’t think this is going to work for a lot of the constructs in ODF. The reason is that it’s not always a simple mapping - for some basic constructs like paragraphs and (some) tables it will work, but there will be other cases where more complex processing is needed. So I think, at least for the time being (the very interesting DSL ideas we’ve been discussing notwithstanding), a first cut that has one big switch statement for all the supported node types is more likely to be successful. This way, you can do any arbitrary processing you need for a given node type, and are not restricted to simply mapping it to a particular HTML element.

In terms of the formatting notes like those for italic and bold, I would suggest instead building up a set of CSS properties rather than creating HTML tags for <b>, <i>, and <u>. The reason for this is that there are only a few such tags in HTML, but there are many other formatting properties that can’t be expressed in this manner and instead use CSS. An <span> element with style=“font-weight: bold” attribute is equivalent to <b>, and there’s some code somewhere in the html directory which from memory I think converts between the two. So creating a CSSProperties object and setting the relevant name/value pairs in that will enable you to serialise the result and place that in a span tag.

The other reason the CSS approach is more appropriate is that it can also be used for stylesheets. For automatic styles in ODF, we want to translate those to style=“…” attributes in HTML (that is, direct formatting, which is essentially what automatic styles are). However for normal styles, we want an entry in the CSS stylesheet, and then reference that from the HTML element via the class=“…” attribute.

Have a look in the ooxml/src/word/formatting directly for how this is handled in the Word filter. This takes an XML node from the Word document as input, and populates a CSSProperty object with the appropriate values. There are also functions to go the other way, when performing an update. I would recommend an approach similar to this.

Coming back to HTML_B and friends: I just had a look at HTMLNormalization.c and it looks like it only does this in the inverse situation to what I described above. That is, when reading a HTML file and preparing it for conversion into a Word document, it converts <b>, <i>, <u> etc into <span> tags with the appropriate CSS properties set in the style attribute. It doesn’t go the other way, though that could potentially be done. Both approaches are essentially identical anyway in terms of how they will render in a browser and be treated by the editor.

> 
> I also had to do some surgery on DocFormats/core/src/xml/DFNameMap.* so I
> could access DFNameMap.

It isn’t actually necessary to put this stuff in the header - it’s best to keep the struct definition in the C file and only ever access it through the functions exposed in the header. If you’re not accessing any of the fields of DFNameMap (which you’re not, at least in the code currently in the repository), then the compiler simply needs to know that there exists a struct type called DFNameMap, without knowing what it’s fields actually are. The following line in DFNameMap.h declares the typedef:

typedef struct DFNameMap DFNameMap;

Everything you need to do with the name map can be achieved with the public functions - and in the event you find something that can’t be done, it’s better to either add a new function. Though this shouldn’t be necessary; if you find such situations let me know and I’ll explain how to do it with the existing functions :)

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)