You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/12/06 14:22:58 UTC

[jira] [Commented] (TIKA-2191) Apply current .docx unit tests to experimental SAX parser and fix or document as necessary

    [ https://issues.apache.org/jira/browse/TIKA-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15725620#comment-15725620 ] 

Tim Allison commented on TIKA-2191:
-----------------------------------

Just pushed a number of fixes focused on hyperlinks, <b|i> tags, extracting objects embedded in headers, etc., and handling for docm files (to extract macros).

The SAX parser still needs:
1) application of styles
2) application paragraph numbering
3) application of bookmarks
4) placement of footnotes closer to citation/paragraph.

> Apply current .docx unit tests to experimental SAX parser and fix or document as necessary
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2191
>                 URL: https://issues.apache.org/jira/browse/TIKA-2191
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>
> There are many areas for clean up to ensure that the new SAX .docx parser yields similar results to the legacy DOM .docx parser.  Let's use this issue to track work on improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)