You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/05/12 21:17:17 UTC
[jira] [Comment Edited] (TIKA-1294) Add ability to turn off
extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995491#comment-13995491 ]
Tim Allison edited comment on TIKA-1294 at 5/12/14 7:16 PM:
------------------------------------------------------------
Great. Just to make sure that I understand correctly...I think I was going to head this route at one point via subclassing EmbeddedResourceHandler. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage?
was (Author: tallison@mitre.org):
Great. Just to make sure that I understand correctly...I think I was going to head this route at one point. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage?
> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.
--
This message was sent by Atlassian JIRA
(v6.2#6252)