You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/07/24 15:15:38 UTC

[jira] [Updated] (TIKA-1374) Need to add code to look for OS-specific keys for embedded files within PDFs

     [ https://issues.apache.org/jira/browse/TIKA-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-1374:
------------------------------

    Description: 
Embedded files in PDFs can be found by the general all purpose key we  currently use via PDFBox:  "F".  However, embedded documents can also be stored under OS specific keys: "DOS", "Mac" and "Unix".

[~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well.  As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them.

My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile<OS>() in PDFBox).  This has the downside of potentially extracting basically duplicate documents, but I'd prefer to err on the side of extracting everything.

The code fix is trivial, and I'll try to commit it today.  However, it will take me a bit of time to generate a test file that stores files under the OS specific keys.  So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do.

  was:
Embedded files in PDFs can be found by the general all purpose key we  currently use via PDFBox:  "EF/F".  However, embedded documents can also be stored under OS specific keys: "EF/DOS", "EF/Mac" and "EF/Unix".

[~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well.  As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them.

My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile<OS>() in PDFBox).  The code fix is trivial, and I'll try to commit it today.  However, it will take me a bit of time to generate a test file that stores files under the OS specific keys.  So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do.


> Need to add code to look for OS-specific keys for embedded files within PDFs
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-1374
>                 URL: https://issues.apache.org/jira/browse/TIKA-1374
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.6
>
>
> Embedded files in PDFs can be found by the general all purpose key we  currently use via PDFBox:  "F".  However, embedded documents can also be stored under OS specific keys: "DOS", "Mac" and "Unix".
> [~lehmi] confirmed on the PDFBox users [list|http://mail-archives.apache.org/mod_mbox/pdfbox-users/201407.mbox/%3c1572548479.2099779.1406198761475.open-xchange@patina.store%3e] that we might be missing embedded documents if we're not trying the OS specific keys as well.  As Andreas points out, according to the spec the OS specific keys shouldn't be used any more, but I think we should support extraction for them.
> My proposal is to pull all documents that are available by any of the four keys (well, via getEmbeddedFile<OS>() in PDFBox).  This has the downside of potentially extracting basically duplicate documents, but I'd prefer to err on the side of extracting everything.
> The code fix is trivial, and I'll try to commit it today.  However, it will take me a bit of time to generate a test file that stores files under the OS specific keys.  So, if anyone has an ASF-friendly file available or wants to take the task of generating one, please do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)