You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ray Gauss II (JIRA)" <ji...@apache.org> on 2014/05/12 21:01:19 UTC

[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

    [ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995474#comment-13995474 ] 

Ray Gauss II commented on TIKA-1294:
------------------------------------

We ran into this exact issue recently and there is another method to achieve the same result without changing Tika code.

In {{ParsingEmbeddedDocumentExtractor.shouldParseEmbedded}} the {{ParseContext}} is checked for a {{DocumentSelector}}.

Since that extractor seems to be the only place that type is checked for (perhaps {{EmbeddedDocumentSelector}} would be a more appropriate name?) you can create one that suits your needs and set it as the document selector value in the {{ParseContext}}.

In our case we created a simple {{MediaTypeDisablingDocumentSelector}} that holds a list of {{disabledMediaTypes}}.

See [{{TikaGUI}}|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java] and its {{ImageDocumentSelector}} as a general example of document selector use.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types of embedded resources.  I see two ways of allowing the client to choose whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments.  The client can then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)