You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@corinthia.apache.org by "jan iversen (JIRA)" <ji...@apache.org> on 2015/01/18 10:30:34 UTC

[jira] [Updated] (COR-20) Write an XML/HTML parser

     [ https://issues.apache.org/jira/browse/COR-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jan iversen updated COR-20:
---------------------------
    Component/s: DocFormats - platform
                 DocFormats - core

> Write an XML/HTML parser
> ------------------------
>
>                 Key: COR-20
>                 URL: https://issues.apache.org/jira/browse/COR-20
>             Project: Corinthia
>          Issue Type: Improvement
>          Components: DocFormats - core, DocFormats - platform
>            Reporter: Peter Kelly
>             Fix For: 0.5
>
>
> Currently we rely on libxml2 and HTML Tidy for parsing XML and HTML, respectively. In both cases we are only using the parsing functions of libraries, not other features like the DOM tree or other things.
> Parsing XML is not very difficult to do. HTML slightly more, because of all the ambiguities that arise from the poorly-defined parsing rules in earlier versions of the spec ("make a best effort" became "replicate what internet explorer does" because almost every site violated the rules). However the HTML5 spec now defines a proper parsing algorithm that deals with said ambiguities. We'll need to also take into account the details of which tags must have a corresponding close dag and which tags do not require this.
> Having our own parser will simplify dependencies a lot, particularly with the somewhat awkward HTML tidy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)