You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/03/22 14:31:00 UTC

[jira] [Comment Edited] (TIKA-3332) Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree

    [ https://issues.apache.org/jira/browse/TIKA-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306245#comment-17306245 ] 

Tim Allison edited comment on TIKA-3332 at 3/22/21, 2:30 PM:
-------------------------------------------------------------

Does this one work as an example? 

[https://corpora.tika.apache.org/base/docs/bug_trackers/pdfcpu/pdfcpu-201-0.zip-0.pdf]

 

This file derives from: https://github.com/pdfcpu/pdfcpu/issues/201


was (Author: tallison@mitre.org):
Does this one work as an example? 

 

https://corpora.tika.apache.org/base/docs/bug_trackers/pdfcpu/pdfcpu-201-0.zip-0.pdf

> Embedded files not extracted from PDF files with multilevel EmbeddedFiles tree
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3332
>                 URL: https://issues.apache.org/jira/browse/TIKA-3332
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.25
>            Reporter: Ross Johnson
>            Priority: Major
>         Attachments: Screen Shot 2021-03-22 at 10.29.51 AM.png, image-2021-03-20-13-36-48-525.png
>
>
> I have come across some portfolio PDFs that have many attachments / embedded files, but Tika is not detecting or extracting them as it does with some other portfolio PDFs. The issue may be that these files have a multilevel EmbeddedFiles name tree that is not being handled properly by PDFBox.
> Here is the EmbeddedFiles structure of one of the PDF portfolios in question. Notice that the root EmbeddedFiles dictionary has a Kids array that only consists of intermediate dictionaries, with the actual Names array being one more level down.
> !image-2021-03-20-13-36-48-525.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)