You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/03/20 18:57:00 UTC
[jira] [Commented] (TIKA-3332) Embedded files not extracted from
PDF files with multilevel EmbeddedFiles tree
[ https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305547#comment-17305547 ]
Tilman Hausherr commented on TIKA-3332:
---------------------------------------
The ExtractEmbeddedFiles.java example code in PDFBox is flawed because it extracts only one level. The Tika code is aware of this flaw because there is a comment
{code:java}
//If there is a need we could add a fully recursive search to find a non-null
//Map<String, COSObjectable> that contains the doc info. {code}
Can you attach one of these files?
> Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
> ------------------------------------------------------------------------------
>
> Key: TIKA-3332
> URL: https://issues.apache.org/jira/browse/TIKA-3332
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.25
> Reporter: Ross Johnson
> Priority: Major
> Attachments: image-2021-03-20-13-36-48-525.png
>
>
> I have come across some portfolio PDFs that have many attachments / embedded files, but Tika is not detecting or extracting them as it does with some other portfolio PDFs. The issue may be that these files have a multilevel EmbeddedFiles name tree that is not being handled properly by PDFBox.
> Here is the EmbeddedFiles structure of one of the PDF portfolios in question. Notice that the root EmbeddedFiles dictionary has a Kids array that only consists of intermediate dictionaries, with the actual Names array being one more level down.
> !image-2021-03-20-13-36-48-525.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)