You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2010/09/07 17:56:32 UTC

[jira] Created: (TIKA-506) Improve doc and docx parsing to include more things

Improve doc and docx parsing to include more things
---------------------------------------------------

                 Key: TIKA-506
                 URL: https://issues.apache.org/jira/browse/TIKA-506
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.7
            Reporter: Nick Burch
            Assignee: Nick Burch


There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.

These include:
* Hyperlinks
* Images (img tag referencing the name of the embeded image)
* Headings (when the default heading styles are used)
* Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)

I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915971#action_12915971 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

Brilliant work, Nick! Thanks. The sample.doc runs through Tika like a dream.

Now, do you think it might be feasible to extract font colours?  Or is there currently no support from the POI side of things?

It has become crucial in my work on document analysis to be able to determine the background colour of a table cell, as well as the foreground colour of text (seems odd, I know, but that's how the document originators are encoding some information). Currently I am being forced to divert .doc documents to an OpenOffice.org service for translation to HTML, then using Tika's HtmlParser to decode that into ContentHandler events. Being so close to having a sufficient .doc parser native to Tika (courtesy of the great work of yourself and others) is both exciting and frustrating!

What are your thoughts? Actually, it's actually quite instructive to see what HTML OpenOffice.org produces from a Word document, which is why I say the OfficeParser is currently so close. Wouldn't it be amazing if, in the future, .doc, .docx and .odt versions of the same document were all parsed to the same HTML?

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916738#action_12916738 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

I didn't quite mean that Tika should output the same level of HTML as OpenOffice.org, merely that it would be nice if Tika's OfficeParser, OpenDocumentParser and OOXMLParser could output consistent HTML for the same document content represented in different formats. Currently there are ideosyncratic differences that mean the various formats each get analysed (post HTML output) slightly differently by my code.

As for the colour fonts, my boss is willing to let me have some time to work on it if it proves feasible for me to do. Do you have any explicit pointers to word and POI specs that might help me? I'm not quite sure where to start looking.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-506.
-----------------------------

    Fix Version/s: 0.8
       Resolution: Fixed

Latest patch applied in r1001209.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915114#action_12915114 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

Tables in Word documents (.doc via OfficeParser) appear to be given a <table> start element but no matching </table> end element (nor </tbody> I think).

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914506#action_12914506 ] 

Nick Burch commented on TIKA-506:
---------------------------------

The POI 3.7 beta 3 release vote has passed. I'll apply this patch over the weekend, once the poi jars have propagated to all the maven repos.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915557#action_12915557 ] 

Geoff Jarrad commented on TIKA-506:
-----------------------------------

Good work on extracting more of .doc and .docx documents! 

Interestingly, the OOXMLParser now implements almost all of the (independent) hacks I recently made in my own version of the parser.

Some HTML normalisations for .docx documents that I used, and would be nice to have, were:

<p class="heading_1">...</p>                      -->      <h1>...</h1>
<p class="heading_2">...</p>                      -->      <h2>...</h2>
<p class="hTML_Preformatted">...</p>      -->      <pre>...</pre>

Also, in some documents I have encountered, sometimes a text snippet is obscured by Word adding entity w:smartTag elements. This text does not get extracted by the OOXMLParser, but I'm not sure if the best fix lies in Tika, POI or Microsoft's ooxml schemas. My own hack was to reparse the DOM for paragraphs, looking for w:r elements at any depth (including within w:smartTag), but there must be a better way.

Finally, for the uses to which I put Tika, it would be nice for .doc documents if color font styles could be extracted, but I'm not sure if POI makes these available.



> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916101#action_12916101 ] 

Nick Burch commented on TIKA-506:
---------------------------------

I'm not sure we want to end up with the same HTML as openoffice, as personally I think it contains too many tags that aren't appropriate for the html. (For example, my view is that in the html, the choice of the font is one for the people writing the css, not something that should be copied blindly from word). I quite like the clean, semantically meaningful html we've now got!

In terms of the colours, I suspect they're available somewhere in the bowls of the character and paragraph properties. However, there's no current high level way to get at them AFAIK. You'd probably need to grab a copy of the word specs, figure out which fields hold them, find that in poi, further decode and finally write a nice user-facing access method for that

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Geoff Jarrad (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Geoff Jarrad updated TIKA-506:
------------------------------

    Attachment: sample.doc

The attached sample.doc Word document breaks the OfficeParser:

java.util.NoSuchElementException
	at java.util.AbstractList$Itr.next(AbstractList.java:350)
	at org.apache.tika.parser.microsoft.WordExtractor$CountingIterator.next(WordExtractor.java:401)
	at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:168)
	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:80)
	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:187)

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915246#action_12915246 ] 

Nick Burch commented on TIKA-506:
---------------------------------

Whoops, yes. Fixed in r1001640.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915771#action_12915771 ] 

Nick Burch commented on TIKA-506:
---------------------------------

Wow, that proved tricky and very painful...

As of r1002190 we shouldn't give the error you've seen. We may even stand a better chance of putting the right images in the right place, but word image stuff is even nastier than I'd previously though it was :(

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word12.patch

Updated patch (v12) which tidies up a few bits when building against the POI 3.7 beta 3 release candidate jars.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word9.patch

This updates yesterday's patch, and additionally includes hyperlinks and bold/italic for .doc

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915707#action_12915707 ] 

Nick Burch commented on TIKA-506:
---------------------------------

I've added "HTML Preformatted" -> pre as suggested

Your word document is a bit pesky, as it uses both floating and fixed pictures, and some of the floating pictures reference the same image data...

I'll need to do some more investigating on that

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word6.patch

The attached patch improves the parsing of .docx to include headings, hyperlinks, better text placement, bold/italic and images in the correct place

It needs code that has only just gone into the poi svn, so will need to wait for poi 3.7 beta 3 before being applied

To spot where images go, it also needs the full ooxml schemas file, owing to some odd behaviour of xmlbeans. Hopefully we'll get this one figured out too in time for 3.7 beta 3.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word6.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916858#action_12916858 ] 

Nick Burch commented on TIKA-506:
---------------------------------

In terms of the 3 different formats, I'd suggest you open a new bug so we can track it there. It ought to be possible to get fairly close, but it'll certainly require someone to spend some time preparing documents and testing them...

For the colours, we should probably take the discussion the poi dev list, as it'll almost certainly need some work on POI to expose the information before Tika could use it.

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: sample.doc, tika-word11.patch, tika-word12.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-506) Improve doc and docx parsing to include more things

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch updated TIKA-506:
----------------------------

    Attachment: tika-word11.patch

New patch (v11) adds support for .doc images, and non-nested .doc tables (nested tables is going to need some more POI work). Now the only thing that is supported for .docx but not .doc is nested tables - in .doc the nested tables come out as regular paragraphs.

As part of this, I've had to add a new boolean option to EmbeddedDocumentExtractor, so a parser can tell it if html for the embedded resource has already been output. In most cases, the current behaviour of "no html has yet been output" will apply, and EmbeddedDocumentExtractor should output html as before. In a few cases however, the parser will have done its own markup, so we don't want the extra bits.

The patch needs poi 3.7 beta 3, so can't be applied until that has been released

> Improve doc and docx parsing to include more things
> ---------------------------------------------------
>
>                 Key: TIKA-506
>                 URL: https://issues.apache.org/jira/browse/TIKA-506
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>         Attachments: tika-word11.patch, tika-word6.patch, tika-word9.patch
>
>
> There are several parts of the word documents (.doc and .docx) that we don't currently extract, but which would be nice to have.
> These include:
> * Hyperlinks
> * Images (img tag referencing the name of the embeded image)
> * Headings (when the default heading styles are used)
> * Style information (when a style other than Default or a body is used on a paragraph, markup the p tag with it)
> I'm proposing to add support for these in the near future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.