You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2012/07/04 01:39:33 UTC

[jira] [Updated] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

     [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-948:
------------------------------------

    Attachment: TIKA-948.patch
                EmbeddedPDF.doc

Here's a trivial test document + test case showing the issue; if you run TikaCLI
-z on this you'll get an embedded file extracted as _1402837031.wps,
but that really should be a PDF.

I traced this down a bit, into AbstractPOIFSExtractor, where it calls
POIFSDocumentType.detectType(dir) and that (incorrectly) returns WPS.

I think the logic in POIFSContainerDetector.detect (which guesses the
embedded file's type by looking at the directory listing of the
document node) is too simplistic?  We may need to peek into the
\0001CompObj contents to get the true document type (I can see, using
POI's POIFSViewer that this seems to identify the MediaType of the
file, and processStarDrawOrImpress already does so...).

But I don't know the format of the bytes in \0001CompObj.

Or maybe alternatively ... we can pull the CONTENTS bytes and
auto-detect on that.  Basically we somehow need to determine if it's
another office format (and do what we now do) else pull the CONTENTS
bytes and recurse on only that.

                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira