You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Jeremy Anderson (JIRA)" <ji...@apache.org> on 2011/09/01 19:34:11 UTC

[jira] [Created] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

PDF and Outlook docs embedded in MS Word documents not parsed
-------------------------------------------------------------

                 Key: TIKA-704
                 URL: https://issues.apache.org/jira/browse/TIKA-704
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9
         Environment: Windows 7 64-bit
            Reporter: Jeremy Anderson


Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).

>From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:

PDF's: application/vnd.ms-works
.MSG: application/x-tika-msoffice

The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101077#comment-13101077 ] 

Jukka Zitting commented on TIKA-704:
------------------------------------

Thanks! I added the test cases in revision 1167052.

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: LicensedTestWithOutlook.docx, LicensedTestWithPdf.docx, TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jeremy Anderson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098920#comment-13098920 ] 

Jeremy Anderson commented on TIKA-704:
--------------------------------------

Thanks for the fast attention for the fix.  Didn't realize that these type of test files can or are desired to be included... I'll either change the license on those already loaded (MSG example) and/or re-uupload (a better PDF example).

Thanks again!!!

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-704.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

Thanks for bringing this up! Fixed in revision 1164578 with better handling of embedded OLE objects.

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097212#comment-13097212 ] 

Jukka Zitting commented on TIKA-704:
------------------------------------

See also revisions 1165230 and 1165259 for followup work.

Jeremy, I notice you didn't select the "Grant license to ASF for inclusion in ASF works (as per the Apache License §5)" when uploading the test documents. Was this on purpose, or is it OK if we include at least the testWithPdf.docx document as a test case in Tika?

The email attachment in testWithOutlook.docx contains a PDF under Yamaha copyright, so we probably can't use that in any case. It would be great if you could create an alternative test file without any external content that we could include in Tika.

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jeremy Anderson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Anderson updated TIKA-704:
---------------------------------

    Attachment: LicensedTestWithPdf.docx
                LicensedTestWithOutlook.docx

These are licensed versions... No yamaha manual this time

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: LicensedTestWithOutlook.docx, LicensedTestWithPdf.docx, TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101087#comment-13101087 ] 

Jukka Zitting commented on TIKA-704:
------------------------------------

Hmm, there was still a hidden copy of the Yamaha manual in the test file. I removed that in revision 1167056, which also brought down the size of the file from 3.9MB to a more comfortable 100KB.

> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: LicensedTestWithOutlook.docx, LicensedTestWithPdf.docx, TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-704) PDF and Outlook docs embedded in MS Word documents not parsed

Posted by "Jeremy Anderson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Anderson updated TIKA-704:
---------------------------------

    Attachment: TestWithPdf.docx
                TestWithOutlook.docx
                recursiveUsage.txt

Sample files:

•RecursiveParser being used.
•Word document with a embedded .msg.
•Word document with some embedded pdf's.


> PDF and Outlook docs embedded in MS Word documents not parsed
> -------------------------------------------------------------
>
>                 Key: TIKA-704
>                 URL: https://issues.apache.org/jira/browse/TIKA-704
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>         Environment: Windows 7 64-bit
>            Reporter: Jeremy Anderson
>         Attachments: TestWithOutlook.docx, TestWithPdf.docx, recursiveUsage.txt
>
>
> Currently there appear to be issues with embedded pdf's and outlook Msg files contained in MS Word documents. I'll attach a sample for each and my recursive parser (incase the problem lies in there).
> From what I see, when these embedded objects are parsed, they're initially identified as vnd.openxmlformats-officedocument.oleObject in the metadata's Content-Type field. After a call to the RecurciveParsers super parse class the Content-Types update to the following:
> PDF's: application/vnd.ms-works
> .MSG: application/x-tika-msoffice
> The internal AutoDetectParser is unable to properly identify these PDF's and therfore does not call the PDFParser on them.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira