You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/05/27 21:39:01 UTC

[jira] [Closed] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

     [ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison closed TIKA-1294.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.6
         Assignee: Tim Allison

Thank you, [~jukkaz].  Fixed in r1597856.  

I also added a parameter to extract unique images only...if every page uses an image with the same cos id, extract only the first one.

In govdocs1 239665, there are roughly 2700 images, but only 56 unique images.

[~rgauss], let me know if these mods work for you.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>             Fix For: 1.6
>
>         Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types of embedded resources.  I see two ways of allowing the client to choose whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments.  The client can then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)