You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/09/14 21:12:32 UTC

[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word6.patch

The attached patch improves the parsing of .docx to include headings, hyperlinks, better text placement, bold/italic and images in the correct place

It needs code that has only just gone into the poi svn, so will need to wait for poi 3.7 beta 3 before being applied

To spot where images go, it also needs the full ooxml schemas file, owing to some odd behaviour of xmlbeans. Hopefully we'll get this one figured out too in time for 3.7 beta 3.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word6.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.