You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2008/11/15 18:27:44 UTC

[jira] Created: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

New Open Document Parser that emmits structured XHTML content.
--------------------------------------------------------------

                 Key: TIKA-172
                 URL: https://issues.apache.org/jira/browse/TIKA-172
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.2-incubating
            Reporter: Uwe Schindler


The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.

The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.

The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.

With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).

As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
If support for such ld document types is not needed, we could simply leave out this additional decorator.

This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-172:
-------------------------------

    Attachment: TIKA-172.patch

Updated patch, that handles spreadsheet documents better, beuase OpenOffice generates a lot of empty cells with repeat-attribute. This is also transformed to colspans in HTML.

> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
>                 Key: TIKA-172
>                 URL: https://issues.apache.org/jira/browse/TIKA-172
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-172.patch, TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
> a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-172:
-------------------------------

    Attachment: TIKA-172.patch

patch for ODF support

> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
>                 Key: TIKA-172
>                 URL: https://issues.apache.org/jira/browse/TIKA-172
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
> a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647884#action_12647884 ] 

Uwe Schindler commented on TIKA-172:
------------------------------------

Additionally my patch contains a MIME mapping for the AutoDetect Parser for old OpenOffice 1.0 files. Additional mappings for the other fileformats are still missing.

In my opinion, it would be good to have only one mapping for ODF and OpenOffice 1.0 files in general (using a MIME type like "application/x-tika-open-document") that only detects the format using the ZIP signature and the start of the MIME type (not the complete one with the exact type). The correct MIME type is later set by the parser, when opening the ZIP file and reading "mimetype". This is similar to the generic MIME type for MSOffice. With this it would also be possible to correctly detect other types of ODF documents (currently not supported by TIKA directly). Currenty they are detected as ZIP files, which is not correct.

> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
>                 Key: TIKA-172
>                 URL: https://issues.apache.org/jira/browse/TIKA-172
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
> a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-172) New Open Document Parser that emmits structured XHTML content.

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-172:
-------------------------------

    Attachment:     (was: TIKA-172.patch)

> New Open Document Parser that emmits structured XHTML content.
> --------------------------------------------------------------
>
>                 Key: TIKA-172
>                 URL: https://issues.apache.org/jira/browse/TIKA-172
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-172.patch
>
>
> The current Open Document parser is very simplistic. It only creates a paragraph with the whole text content of ODF documents in it. The problem is also, that all whitespace is stripped.
> The attached patch is a new and SAX-featured (so low memory capable) parser without using external libraries for ODF. The structure of ODF content.xml files is very clean (and identical for all types of documents) and maps very good to XHTML. It is possible to map paragraphs to <p> tags and headings to <hX>-Tags. Also tables (and so spreadsheets) are identical to HTML rules.
> The idea behind this parser is a simple tag mapping approach. A new ContentHandlerDecorator in the o.a.t.sax-Package is able to simple map element names and attributes by a Map<javax.xml.namespace.QName,...). For each mapping a second mapping for the attributes Map<javax.xml.namespace.QName,javax.xml.namespace.QName> is available that maps the attributes. All not mappable attributes are thrown away. Tag names not in the mapping are are also not reported to the delegate.
> With this new decorator, it is possible to map all ODF content.xml names to XHTML using a static map in the parser class. In addition to this some extra-handling for special cases in ODF are done in the SAX handler, that receives the parsing events (that extends ElementMappingContentHandler) is done:
> a) only direct content of tags from the text:-namespace are reported to characters(), this excludes style tags and so on.
> b) some tags and *all* its content are left out (Templates for TOC, additional cells for col/rowspan handling)
> c) mapping of <text:h> to HTML <hX> is done by using the heading level (in ODF in an attribute of <text:h>).
> As there are still some OpenOffice version 1.0 documents around (.sxw-files) that use old namespace declarations in meta.xml and content.xml (the current parser fails to parse metadata and content of such documents), an additional ContentHandlerDecorator is used, that maps all old namespaces beginning with "http://openoffice.org/2000/" to the "urn:oasis..." ones.
> If support for such ld document types is not needed, we could simply leave out this additional decorator.
> This is a very clean and good working approach for ODF files. In my opinion, this could also be done in a similar way for OpenXML files for MS Office 2007. I looked into the new POI version, that has text extraction support for OpenXML, but this uses a lot of additional XML parser libraries, DOM trees and does not use SAX, and is memory intensive. I think (I will read the specs from Microsoft the next days) and maybe I will create the same infracstruture for OpenXML, too. As POI is for OLE2 document format, it should only be used for this and not the XML based OpenXML.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.