You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2012/05/05 12:41:49 UTC

[jira] [Created] (PDFBOX-1303) Tika's PDFParser fails to parse documents embedded in a PDF Package

Michael McCandless created PDFBOX-1303:
------------------------------------------

             Summary: Tika's PDFParser fails to parse documents embedded in a PDF Package
                 Key: PDFBOX-1303
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1303
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Michael McCandless
             Fix For: 1.7.0


In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
visit documents embedded with a PDF document (ie a PDF package).

Tika can actually handle this better than ExtractText since it can
recurse on any embedded document type (not just PDFs) and parse them
as well, vs ExtractText which only extracts when the embedded
documents are also PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (PDFBOX-1303) Tika's PDFParser fails to parse documents embedded in a PDF Package

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-1303.
----------------------------------------

    Resolution: Fixed
      Assignee: Andreas Lehmkühler

I applied the path cin revision 1339254 as proposed. I changed the test case from PDFBOX-1297 so that both can share the input pdf.

Thanks for the contribution!
                
> Tika's PDFParser fails to parse documents embedded in a PDF Package
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-1303
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1303
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Michael McCandless
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX-1303.patch, testPDFPackage.pdf
>
>
> In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
> visit documents embedded with a PDF document (ie a PDF package).
> Tika can actually handle this better than ExtractText since it can
> recurse on any embedded document type (not just PDFs) and parse them
> as well, vs ExtractText which only extracts when the embedded
> documents are also PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1303) Tika's PDFParser fails to parse documents embedded in a PDF Package

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated PDFBOX-1303:
---------------------------------------

    Attachment: testPDFPackage.pdf
                PDFBOX-1303.patch

Patch w/ test case.

The code to visit the embedded documents is basically the same as PDFBOX-1297, except I invoke Tika's EmbeddedDocumentExtractor (defaulting to ParsingEmbeddedDocumentExtractor) for each...
                
> Tika's PDFParser fails to parse documents embedded in a PDF Package
> -------------------------------------------------------------------
>
>                 Key: PDFBOX-1303
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1303
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Michael McCandless
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX-1303.patch, testPDFPackage.pdf
>
>
> In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
> visit documents embedded with a PDF document (ie a PDF package).
> Tika can actually handle this better than ExtractText since it can
> recurse on any embedded document type (not just PDFs) and parse them
> as well, vs ExtractText which only extracts when the embedded
> documents are also PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira