You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2012/05/05 12:41:49 UTC
[jira] [Created] (PDFBOX-1303) Tika's PDFParser fails to parse
documents embedded in a PDF Package
Michael McCandless created PDFBOX-1303:
------------------------------------------
Summary: Tika's PDFParser fails to parse documents embedded in a PDF Package
Key: PDFBOX-1303
URL: https://issues.apache.org/jira/browse/PDFBOX-1303
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Reporter: Michael McCandless
Fix For: 1.7.0
In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
visit documents embedded with a PDF document (ie a PDF package).
Tika can actually handle this better than ExtractText since it can
recurse on any embedded document type (not just PDFs) and parse them
as well, vs ExtractText which only extracts when the embedded
documents are also PDF.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PDFBOX-1303) Tika's PDFParser fails to parse
documents embedded in a PDF Package
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-1303.
----------------------------------------
Resolution: Fixed
Assignee: Andreas Lehmkühler
I applied the path cin revision 1339254 as proposed. I changed the test case from PDFBOX-1297 so that both can share the input pdf.
Thanks for the contribution!
> Tika's PDFParser fails to parse documents embedded in a PDF Package
> -------------------------------------------------------------------
>
> Key: PDFBOX-1303
> URL: https://issues.apache.org/jira/browse/PDFBOX-1303
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Michael McCandless
> Assignee: Andreas Lehmkühler
> Fix For: 1.7.0
>
> Attachments: PDFBOX-1303.patch, testPDFPackage.pdf
>
>
> In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
> visit documents embedded with a PDF document (ie a PDF package).
> Tika can actually handle this better than ExtractText since it can
> recurse on any embedded document type (not just PDFs) and parse them
> as well, vs ExtractText which only extracts when the embedded
> documents are also PDF.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1303) Tika's PDFParser fails to parse
documents embedded in a PDF Package
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated PDFBOX-1303:
---------------------------------------
Attachment: testPDFPackage.pdf
PDFBOX-1303.patch
Patch w/ test case.
The code to visit the embedded documents is basically the same as PDFBOX-1297, except I invoke Tika's EmbeddedDocumentExtractor (defaulting to ParsingEmbeddedDocumentExtractor) for each...
> Tika's PDFParser fails to parse documents embedded in a PDF Package
> -------------------------------------------------------------------
>
> Key: PDFBOX-1303
> URL: https://issues.apache.org/jira/browse/PDFBOX-1303
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Reporter: Michael McCandless
> Fix For: 1.7.0
>
> Attachments: PDFBOX-1303.patch, testPDFPackage.pdf
>
>
> In working on PDFBOX-1297, I realized Tika's PDFParser also doesn't
> visit documents embedded with a PDF document (ie a PDF package).
> Tika can actually handle this better than ExtractText since it can
> recurse on any embedded document type (not just PDFs) and parse them
> as well, vs ExtractText which only extracts when the embedded
> documents are also PDF.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira