You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/08/17 17:40:01 UTC

[jira] [Commented] (TIKA-2374) Tika App -z should extract PDF inline images by default

    [ https://issues.apache.org/jira/browse/TIKA-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130891#comment-16130891 ] 

Tim Allison commented on TIKA-2374:
-----------------------------------

For posterity, to process inline images (e.g. for OCR'ing pdfs):

{noformat}
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>
{noformat}

> Tika App -z should extract PDF inline images by default
> -------------------------------------------------------
>
>                 Key: TIKA-2374
>                 URL: https://issues.apache.org/jira/browse/TIKA-2374
>             Project: Tika
>          Issue Type: Improvement
>          Components: cli
>    Affects Versions: 1.14
>            Reporter: Nick Burch
>             Fix For: 1.16
>
>
> As discussed on dev@ - If you use the Tika App with the default config and the {{-z}} extract option, it will extract embedded resources, except PDF inline images. This is unexpected for new users, who won't know that they'd need to pass in a custom config with the {{extractInlineImages}} PDF parser option set
> If the user passes in an explicit config to the app, we should respect that. However, if they don't pass one in and take the default, the -z option should (but only that one) enable whatever options are needed to make extraction work properly + fully (currently just {{extractInlineImages}})
> If possible/easy, the -z option should print out some info to let affected users know that the default config was tweaked to give extra embedded resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)