You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@corinthia.apache.org by Gabriela Gibson <ga...@gmail.com> on 2015/05/10 03:52:24 UTC

Branch"odf-filter-attempt2": review of approach needed

Hi,

So far I got my branch to produce a list of html nodes (and report on
still missing stuff).

This is probably a good point to have a look if the approach I'm using
here is any good.

It of course has quite a few warts still, and I think I will need to
add function pointers to the ODF_to_HTML_key struct to deal with some
special cases.  If that struct is a good idea that is.

The branch can be found here:

https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0

I added the test odt file I was using, plus the current output of the program.

thanks for looking,

G

-- 
Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

Re: Branch"odf-filter-attempt2": review of approach needed

Posted by Gabriela Gibson <ga...@gmail.com>.

All fixed up now, and committed.

Hmm, mail on my mobile phone didn't show up my other reply.  The screen on
those little things definitely is too small.  Or, it was too early in the
morning.

G

On Mon, May 11, 2015 at 5:16 AM, Gabriela Gibson <ga...@gmail.com>
wrote:

> I have changed a couple of things in the meanwhile but for some reason my
> answer to Jan didnt make it to the list (mobile phone victim).
>
> I trimmed struct ODFkey_to_HTMLkey down and added a function pointer
> instead.
>
> Will ship the improved version with tidied code in the morning,  ie in
> about 5 hours.
>
> G
>
> Ps.: check my blog for a funny bug :-D
>  > On 11 May 2015, at 4:33 am, jan i <ja...@apache.org> wrote:
> >
> > Hi
> >
> > Branch looks good, I would like to hear if Peter has any comments on the
> > XML setup, otherwise I suggest to merge it to master (I assume you
> continue
> > in the branch)
>
> I’ll have a look at it this evening and get back shortly.
>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>


-- 
Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

Re: Branch"odf-filter-attempt2": review of approach needed

Posted by Gabriela Gibson <ga...@gmail.com>.

I have changed a couple of things in the meanwhile but for some reason my
answer to Jan didnt make it to the list (mobile phone victim).

I trimmed struct ODFkey_to_HTMLkey down and added a function pointer
instead.

Will ship the improved version with tidied code in the morning,  ie in
about 5 hours.

G

Ps.: check my blog for a funny bug :-D
 > On 11 May 2015, at 4:33 am, jan i <ja...@apache.org> wrote:
>
> Hi
>
> Branch looks good, I would like to hear if Peter has any comments on the
> XML setup, otherwise I suggest to merge it to master (I assume you
continue
> in the branch)

I’ll have a look at it this evening and get back shortly.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: Branch"odf-filter-attempt2": review of approach needed

Posted by Peter Kelly <pm...@apache.org>.

> On 11 May 2015, at 4:33 am, jan i <ja...@apache.org> wrote:
> 
> Hi
> 
> Branch looks good, I would like to hear if Peter has any comments on the
> XML setup, otherwise I suggest to merge it to master (I assume you continue
> in the branch)

I’ll have a look at it this evening and get back shortly.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: Branch"odf-filter-attempt2": review of approach needed

Posted by Gabriela Gibson <ga...@gmail.com>.

I changed the ODFkey_to_HTMLkey struct to use function ptrs where
applicable and tidied the code up, will ship update in the morning.

G
ps.: check my blog for funny bug output :-D
On 10 May 2015 22:33, "jan i" <ja...@apache.org> wrote:

> Hi
>
> Branch looks good, I would like to hear if Peter has any comments on the
> XML setup, otherwise I suggest to merge it to master (I assume you continue
> in the branch)
>
> keep up the good work
> rgds
> jan i
>
> On Sunday, May 10, 2015, Gabriela Gibson <ga...@gmail.com>
> wrote:
>
> > Hi,
> >
> > So far I got my branch to produce a list of html nodes (and report on
> > still missing stuff).
> >
> > This is probably a good point to have a look if the approach I'm using
> > here is any good.
> >
> > It of course has quite a few warts still, and I think I will need to
> > add function pointers to the ODF_to_HTML_key struct to deal with some
> > special cases.  If that struct is a good idea that is.
> >
> > The branch can be found here:
> >
> >
> >
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> >
> > I added the test odt file I was using, plus the current output of the
> > program.
> >
> > thanks for looking,
> >
> > G
> >
> > --
> > Visit my Coding Diary: http://gabriela-gibson.blogspot.com/
> >
>
>
> --
> Sent from My iPad, sorry for any misspellings.
>

Re: Branch"odf-filter-attempt2": review of approach needed

Posted by jan i <ja...@apache.org>.

Hi

Branch looks good, I would like to hear if Peter has any comments on the
XML setup, otherwise I suggest to merge it to master (I assume you continue
in the branch)

keep up the good work
rgds
jan i

On Sunday, May 10, 2015, Gabriela Gibson <ga...@gmail.com> wrote:

> Hi,
>
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
>
> This is probably a good point to have a look if the approach I'm using
> here is any good.
>
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
>
> The branch can be found here:
>
>
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
>
> I added the test odt file I was using, plus the current output of the
> program.
>
> thanks for looking,
>
> G
>
> --
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/
>


-- 
Sent from My iPad, sorry for any misspellings.

Attributes (Branch"odf-filter-attempt2" review)

Posted by Peter Kelly <pm...@apache.org>.

Ok I started writing out a response and it got quite long, so I’m going to split this into multiple emails to help separate subsequent discussion of particular aspects.

printNode:

The attrs field of the DFNode struct is an array, not a single object. It contains attrsCount elements. Accessing it in the way that printNode does is incorrect for two reasons:

1) The fact that attrs is non-NULL doesn’t necessarily mean that the node has attributes. It’s possible that it did at one point have attributes, but those were all removed. In such a case, attrsCount will be zero, but attrs will still be non-NULL. Have a look at the implementation of DFGetAttribute, DFSetAttribute, DFRemoveAttribute in DFDOM.c to see how attrs, attrsCount, and attrsAlloc are used.

2) If there are multiple attributes, this will only print the first.

To go through all the attributes and print out their tags & values, you would use a loop:

    for (int i = 0; i < n->attrsCount; i++) {
        printf("tag %u value %s\n",n->attrs[i].tag,n->attrs[i].value);
    }

You can also use DFGetAttribute if you want to already know the tag you are looking for (e.g. TEXT_OUTLINE_LEVEL). This function will return NULL if the attribute does not exist, or the attribute value if it does.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 10 May 2015, at 8:52 am, Gabriela Gibson <ga...@gmail.com> wrote:
> 
> Hi,
> 
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
> 
> This is probably a good point to have a look if the approach I'm using
> here is any good.
> 
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
> 
> The branch can be found here:
> 
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> 
> I added the test odt file I was using, plus the current output of the program.
> 
> thanks for looking,
> 
> G
> 
> -- 
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

printMissingTag (Branch"odf-filter-attempt2" review)

Posted by Peter Kelly <pm...@apache.org>.

This code:

    char *printMissingTag(Tag tag)
    {
        char *s = translateXMLEnumName[tag];
        int len = strlen(s)+14+20;
        char *r = malloc(len);
        snprintf(r, len, "%s Missing tag: %s %s",RED, s, RESET);
        return r;
    }

Is likely to be incorrect, because it uses hard-coded numbers and it took me a good while to figure out where the 14 and 20 were derived from. Getting this wrong can lead to buffer overflows, which can result in crashes or security vulnerabilities or other problems.

There’s a function called DFFormatString in DFString.h which will basically do the above for you. So you can just use the following instead:

    char *printMissingTag(Tag tag)
    {
        char *s = translateXMLEnumName[tag];
        return DFFormatString("%s Missing tag: %s %s",RED, s, RESET);
    }

Also, the fact that you’re assigning this to the value attribute of a DFNode object will lead to a memory leak, because you’re not freeing that memory anywhere. Actually all of the memory for DFNode objects and the attributes and string values they maintained is supposed to be allocated by the DFDocument’s own internal memory allocator, which, when you release the last reference to a DFDocument object, releases all the memory in one go.

There’s a function called DFSetNodeValue which should always be used to set the value of text nodes. You can see the implementation of this in DFDOM.c; it uses the document’s memory allocator to make a copy of the string, and then assign that to the node’s value. So if you were to keep the printMissingTag as it exists above, you would replace the following code in ODFText.c:

    child->value = printMissingTag(odfChild->tag);

with:

    char *value = printMissingTag(odfChild->tag);
    DFSetNodeValue(child,value);
    free(value);

Note that we free the result of printMissingTag here, as otherwise the memory will leak.

I’d also suggest a different name than “printMissingTag”, since the function itself doesn’t actually do any printing (it instead returns a string, which the caller can either print or do something else with, e.g. assigning it to a text node value).

Finally, while the following line:

    child = DFCreateChildElement(htmlNode, 0);

creates an invalid node; you’re using 0 as the tag. Actually you’re creating a text node here, not an element, so DFCreateChildTextNode would be the appropriate function to use here instead. Combining this with the above, you could replace the following two lines:

    child = DFCreateChildElement(htmlNode, 0);
    child->value = printMissingTag(odfChild->tag);

with:

    char *value = printMissingTag(odfChild->tag);
    child = DFCreateChildTextNode(htmlNode,value);
    free(value);

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 10 May 2015, at 8:52 am, Gabriela Gibson <ga...@gmail.com> wrote:
> 
> Hi,
> 
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
> 
> This is probably a good point to have a look if the approach I'm using
> here is any good.
> 
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
> 
> The branch can be found here:
> 
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> 
> I added the test odt file I was using, plus the current output of the program.
> 
> thanks for looking,
> 
> G
> 
> -- 
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

Tags and names (Branch"odf-filter-attempt2" review)

Posted by Peter Kelly <pm...@apache.org>.

The translateXMLEnumName array is unnecessary; there’s already a similar array in DFXMLNames.c. If you just want to get the name of a node (without concern to the namespace), you can use DFNodeName.

More generally, I should explain what the tags are for and how they’re used. With namespaces in XML, the XML file defines a set of prefix/URI mappings either at the top of the file or on individual elements. This is a real pain to deal with because different prefixes can refer to different URIs throughout the document, and what an application is almost always interested in is the (namespace URI, localName) pair; the prefix is purely a convenience mechanism for human readers of the XML data and to reduce file size.

So I came up with the Tag mechanism in which each (namespace URI, localName) pair has a unique integer value, and that gets recorded for each node regardless of the prefix in use. This is done in the parsing code in DFXML.c, and makes use of a DFNameMap structure to keep track of the associations between:

1) namespace URIs (strings) and namespace IDs (unsigned 32-bit integers), and
2) local names (strings) and tags (unsigned 32-bit integers)

All the namespace IDs and tags we’re likely to use are hard-coded in DFXMLNamespaces.h and DFXMLNamespaces.h, so we can refer to them from code and use them in switch statements (which only allow constants for the cases). However if any other namespaces or local names are found during parsing, then the parser will generate new, unallocated integer values for them.

Given a particular tag, there are functions available to get both the namespace URI and local name that tag represents. For example, the tag HTML_H1 has the namespace URI "http://www.w3.org/1999/xhtml <http://www.w3.org/1999/xhtml>” and the local name “h1”. Suppose you have a node n with this tag. If you do the following, you will see these two string values:

    DFNameMap *map = n->doc->map;
    const TagDecl *td = DFNameMapNameForTag(map,n->tag);
    const NamespaceDecl *ns = DFNameMapNamespaceForID(map,td->namespaceID);
    printf("namespace URI = %s\n",ns->namespaceURI);
    printf("local name = %s\n",td->localName);

output:

    namespace URI = http://www.w3.org/1999/xhtml
    local name = body

I’ll go through this step by step:

1. DFNameMap *map = n->doc->map;

This gets a reference to the name map used by the document. Each DFDocument object has a separate map and any tags and namespaceIDs that are not hard-coded must be interpreted within the context of this map. We don’t actually need this as a separate variable, we could just refer to n->doc->map in the following two lines:

2. const TagDecl *td = DFNameMapNameForTag(map,n->tag);

This gets the tag declaration, which consists of a namespace id and a string name. TagDecl is defined in DFXMLNames.h as follows:

typedef struct {
    unsigned int namespaceID;
    const char *localName;
} TagDecl;

3. const NamespaceDecl *ns = DFNameMapNamespaceForID(map,td->namespaceID);

This gets the namespace declaration, which consists of the namespace URI and the a prefix (if there were multiple prefixes used in the input file, this will be the first; for a new document, this will be the default prefix defined in DFXMLNamespace.c)

NamespaceDecl is defined in DFXMLNamespaces.h as follows:

typedef struct {
    const char *namespaceURI;
    const char *prefix;
} NamespaceDecl;

So that’s how we get from Tag -> (localName, namespaceID) -> (namespaceURI, prefix).

Did I mention that XML namespaces suck? ;) This is why everyone uses JSON these days when building web APIs. But I digress...

So that’s more than you probably wanted to know about how DocFormats tries to cover over this tragic design choice.  It’s not so bad though, as there’s some convenience functions which do the above for you:

If you have a tag, you can just do:

    printf("namespace URI = %s\n",DFTagURI(n->doc,n->tag));
    printf("local name = %s\n",DFTagName(n->doc,n->tag));

Even simpler, if you have a node, you can do:

    printf("namespace URI = %s\n",DFNodeURI(n));
    printf("local name = %s\n",DFNodeName(n));

The above two lines of code are all you need to get a string representation of the namespace URI and local name; all the DFNameMap stuff is hidden away inside these functions (you can have a look at them in DFDOM.c to see how they work; it’s basically what I described above).

Having said all this, the only time it should be necessary to actually look at the string representation of tag names or namespaces is for debugging purposes. One of the motivations behind using integers for representing these, in addition to abstracting over the whole prefix mapping mess, was to avoid the need for string comparisons, thus leading to better performance (which is mainly an issue where you want to test against 20+ possibilities in a loop).

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 10 May 2015, at 8:52 am, Gabriela Gibson <ga...@gmail.com> wrote:
> 
> Hi,
> 
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
> 
> This is probably a good point to have a look if the approach I'm using
> here is any good.
> 
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
> 
> The branch can be found here:
> 
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> 
> I added the test odt file I was using, plus the current output of the program.
> 
> thanks for looking,
> 
> G
> 
> -- 
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/

ODF_to_HTML_keys (Branch"odf-filter-attempt2" review)

Posted by Peter Kelly <pm...@apache.org>.

For mapping between ODF and HTML keys, I would suggest starting off with a more traditional/direct approach, where you have either a series of if statements or, equivalently, a switch statement, which checks what the current ODF element in the traversal is, and then goes ahead and creates the appropriate HTML node. So in traverseContent in ODFText.c you could have a switch statement there (remember the breaks at the end each case! - and also { } inside each case for scoping purposes).

I noticed also that in traverseContent there’s this line:

    if (odfChild->tag == 2) { // we have some text here.

I advise against using “magic numbers” like this, because it’s not at all clear what the two means (well, actually your comment makes it clear). But whenever you’re about to write a specific number, the question to ask is can you define a macro or constant whose name matches what the number means.

In fact in DFDOM.h there are the following macros defined:

    #define DOM_DOCUMENT                 1
    #define DOM_TEXT                     2
    #define DOM_COMMENT                  3
    #define DOM_CDATA                    4
    #define DOM_PROCESSING_INSTRUCTION   5

So you could change the line above to:

    if (odfChild->tag == DOM_TEXT) {

and then that makes the code self-describing, removing the need for the comment. Also, if for some reason the specific integer value used for text node was ever changed, then this code would still work correctly as long as the macro was updated. While the DOM_ numbers above are extremely unlikely to ever change, the other pre-defined constants (actually enums) like HTML_H1 defined in DFXMLNames.h are almost certain to change (when the file is re-generated from the script that assigns these numbers when someone adds some new names). So you should always use the symbolic names rather than writing out the numbers directly.

Despite my suggestion of starting with if statements or a switch statement in the traversal to begin with, I like where you’re going conceptually with the idea of representing the information necessary for translation in a data structure rather than code. In fact, whether or not this was a conscious thing or not, you’ve taken the very first step in designing your own domain-specific programming language for expressing transformations on XML data. Code that uses the data structure you define constitutes an interpreter for this language, and the sophistication of both the data structure and the interpreter can be expanded over time to cater for more complex needs. This is something I hope we can explore a lot further, and I’ve got a lot of ideas on this I’ve been thinking about for quite a while now.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 10 May 2015, at 8:52 am, Gabriela Gibson <ga...@gmail.com> wrote:
> 
> Hi,
> 
> So far I got my branch to produce a list of html nodes (and report on
> still missing stuff).
> 
> This is probably a good point to have a look if the approach I'm using
> here is any good.
> 
> It of course has quite a few warts still, and I think I will need to
> add function pointers to the ODF_to_HTML_key struct to deal with some
> special cases.  If that struct is a good idea that is.
> 
> The branch can be found here:
> 
> https://github.com/apache/incubator-corinthia/commit/c81e68626489b9515e7e8f3a5ce5d38ac8f59af0
> 
> I added the test odt file I was using, plus the current output of the program.
> 
> thanks for looking,
> 
> G
> 
> -- 
> Visit my Coding Diary: http://gabriela-gibson.blogspot.com/