You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ross Johnson (Jira)" <ji...@apache.org> on 2021/03/20 17:42:00 UTC
[jira] [Created] (TIKA-3332) Embedded files not extracted from PDF
files with multilevel EmbeddedFiles tree
Ross Johnson created TIKA-3332:
----------------------------------
Summary: Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
Key: TIKA-3332
URL: https://issues.apache.org/jira/browse/TIKA-3332
Project: Tika
Issue Type: Bug
Affects Versions: 1.25
Reporter: Ross Johnson
Attachments: image-2021-03-20-13-36-48-525.png
I have come across some portfolio PDFs that have many attachments / embedded files, but Tika is not detecting or extracting them as it does with some other portfolio PDFs. The issue may be that these files have a multilevel EmbeddedFiles name tree that is not being handled properly by PDFBox.
Here is the EmbeddedFiles structure of one of the PDF portfolios in question. Notice that the root EmbeddedFiles dictionary has a Kids array that only consists of intermediate dictionaries, with the actual Names array being one more level down.
!image-2021-03-20-13-36-48-525.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)