You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/09/07 17:56:32 UTC

[jira] Created: (TIKA-506) Improve doc and docx parsing to include more things

Improve doc and docx parsing to include more things
---------------------------------------------------

                 Key: TIKA-506
                 URL: https://issues.apache.org/jira/browse/TIKA-506
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Assignee: Nick Burch


There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.

These include:
* Hyperlinks
* Images (img tag referencing the name of the embeded image)
* Headings (when the default heading styles are used)
* Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)

I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915971#action_12915971 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

Brilliant work, Nick! Thanks. The sample.doc runs through Tika like a dream.

Now, do you think it might be feasible to extract font colours?  Or is there currently no support from the POI side of things?

It has become crucial in my work on document analysis to be able to determine the background colour of a table cell, as well as the foreground colour of text (seems odd, I know, but that's how the document originators are encoding some information). Currently I am being forced to divert .doc documents to an OpenOffice.org service for translation to HTML, then using Tika's HtmlParser to decode that into ContentHandler events. Being so close to having a sufficient .doc parser native to Tika (courtesy of the great work of yourself and others) is both exciting and frustrating!

What are your thoughts? Actually, it's actually quite instructive to see what HTML OpenOffice.org produces from a Word document, which is why I say the OfficeParser is currently so close. Wouldn't it be amazing if, in the future, .doc, .docx and .odt versions of the same document were all parsed to the same HTML?

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.