You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2008/11/15 18:29:45 UTC

[jira] Updated: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

     [ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-172:
-------------------------------

    Attachment: TIKA-172.patch

patch for ODF support

> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
>                 Key: TIKA-172
>                 URL: https://issues.apache.org/jira/browse/TIKA-172
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
> a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.