You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2012/07/04 01:32:34 UTC

[jira] [Created] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Michael McCandless created TIKA-948:
---------------------------------------

             Summary: Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
                 Key: TIKA-948
                 URL: https://issues.apache.org/jira/browse/TIKA-948
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor


This is just like TIKA-704, except that issue was for an OOXML Word
doc but this is for the older Word 97-2003 format.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Alex Ott (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408180#comment-13408180 ] 

Alex Ott commented on TIKA-948:
-------------------------------

Maybe you also reuse information from prop stream nearby of CompObj?
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-948.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2

I think Nick's commits fixed this.
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 1.2
>
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409094#comment-13409094 ] 

Michael McCandless commented on TIKA-948:
-----------------------------------------

bq. However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough

Ahh, that's a good point...

bq. The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check...

Oh in fact it looks like we do have test coverage for this (TestContainerAwareDetector.testDetectOLE2), so all is good.
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409024#comment-13409024 ] 

Nick Burch commented on TIKA-948:
---------------------------------

If someone feels keen, we could add CompObj decoding. When that's there, we could try to extract helpful data (such as file type) out. However, it doesn't look like it stores the file type as such, just the application the data should be passed to, which might well not be specific enough

The Quill96 thing comes from an old works file, from an old bug report somewhere. IIRC we didn't get a sample file to go with it, but it might be good for someone to work out which mailing list discussion / old issue it's associated with to check...
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-948:
---------------------------------------

    Assignee: Michael McCandless
    
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Alex Ott (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408180#comment-13408180 ] 

Alex Ott edited comment on TIKA-948 at 7/6/12 5:49 PM:
-------------------------------------------------------

Maybe you also reuse information from prop stream nearby of CompObj?

P.S. Format of CompObj is described in MS's documentation, as I remember, there are 2 flavors of them, with different offsets of actual data
                
      was (Author: alexott):
    Maybe you also reuse information from prop stream nearby of CompObj?
                  
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409002#comment-13409002 ] 

Michael McCandless commented on TIKA-948:
-----------------------------------------

Thanks for taking this Nick!  Can you add a CHANGES entry for it?

It seems odd not using the explicitly declared file type inside the CompObj and running detection on the byte[] CONTENTS instead, but it is definitely better than my original patch since it can handle all formats we can detect (not just PDF), so +1.  We can separately (later) decode CompObj if that ever seems necessary...

Also: do we have any test data for that Quill96 (that I almost broke!)?  Would be nice to have test coverage for it...
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-948:
------------------------------------

    Attachment: TIKA-948.patch

Patch... I think it's ready.

I fixed POIFSContainerDetector to also look for PDF, by loading the
CompObj bytes if present (I refactored the Star Impress/Draw code that
was already doing so), and fixed AbstractPOIFSExtractor to handle
non-office embedded documents (well, only PDF for now).

It would good to fix all other non-office document types
that may be embedded ... but I think we can do that in future issue(s).

                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408142#comment-13408142 ] 

Nick Burch commented on TIKA-948:
---------------------------------

Should we not just pass the bytes to a Detector if we have a CompObj but nothing else looks promising?
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408436#comment-13408436 ] 

Nick Burch commented on TIKA-948:
---------------------------------

I think r1358467 should fix the file extension problem in a general way for all (known) file formats. The only thing is we don't currently make use of things like "Adobe Acrobat" in the CompObj to short-cut the detection, it's done in a general way for all formats known or unknown. Not sure if we need to fix this or not though
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-948:
------------------------------------

    Attachment: TIKA-948.patch
                EmbeddedPDF.doc

Here's a trivial test document + test case showing the issue; if you run TikaCLI
-z on this you'll get an embedded file extracted as _1402837031.wps,
but that really should be a PDF.

I traced this down a bit, into AbstractPOIFSExtractor, where it calls
POIFSDocumentType.detectType(dir) and that (incorrectly) returns WPS.

I think the logic in POIFSContainerDetector.detect (which guesses the
embedded file's type by looking at the directory listing of the
document node) is too simplistic?  We may need to peek into the
\0001CompObj contents to get the true document type (I can see, using
POI's POIFSViewer that this seems to identify the MediaType of the
file, and processStarDrawOrImpress already does so...).

But I don't know the format of the bytes in \0001CompObj.

Or maybe alternatively ... we can pull the CONTENTS bytes and
auto-detect on that.  Basically we somehow need to determine if it's
another office format (and do what we now do) else pull the CONTENTS
bytes and recurse on only that.

                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-948) Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408300#comment-13408300 ] 

Nick Burch commented on TIKA-948:
---------------------------------

I've had a go at fixing this in r1358404, using a slightly different method. Michael's suggested fix didn't seem to scale well to additional formats, and would've been a bit confusing for the extractors.

Instead, I've updated POIFSContainerDetector to report that a CompObj that isn't a known "native CompObj" has been found (older Works file, some older Star files etc are all natively stored). This is reported in a similar way to the OLE10Native stuff, which was slightly tweaked. I've then updated the AbstarctPOIFSExtractor to then pull the data out of the CompObj structure. Tika-App -z is then able to correctly extract the pdf out.

One remaining issue is with file extensions, a little bit more work is needed to do that cleanly I think
                
> Embedded PDF extracted incorrectly as MS Works file from Word 97-2003 doc
> -------------------------------------------------------------------------
>
>                 Key: TIKA-948
>                 URL: https://issues.apache.org/jira/browse/TIKA-948
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: EmbeddedPDF.doc, TIKA-948.patch, TIKA-948.patch
>
>
> This is just like TIKA-704, except that issue was for an OOXML Word
> doc but this is for the older Word 97-2003 format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira