You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/08/14 19:52:12 UTC

[jira] [Commented] (TIKA-1396) Embedded images in PDF documents

    [ https://issues.apache.org/jira/browse/TIKA-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14097278#comment-14097278 ] 

Tim Allison commented on TIKA-1396:
-----------------------------------

In 1.5, Tika only extracts "attachments" from pdfs.  If you are trying to extract embedded images, you'll need the soon-to-be-released 1.6 or trunk (see TIKA-1268 and TIKA-1294).

If Tika is failing to extract an attachment from a pdf in 1.5, that is a bug; please post your document if you can share it.

By default, for memory reasons, we've turned off the inline image extraction, and you'll need to turn that on via the config file for the PDFParser see TIKA-1294.



> Embedded images in PDF documents
> --------------------------------
>
>                 Key: TIKA-1396
>                 URL: https://issues.apache.org/jira/browse/TIKA-1396
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: *OS:* 
> Ubuntu 14.04.1 LTS
> *KERNEL:*
> 3.13.0-33-generic 
> gcc version 4.8.2
> *JAVA:*
> java version "1.8.0_11"
> Java(TM) SE Runtime Environment (build 1.8.0_11-b12)
> Java HotSpot(TM) 64-Bit Server VM (build 25.11-b03, mixed mode)
>            Reporter: Damiano
>            Priority: Critical
>
> Hello!
> I just found a problem with PDF documents that have embedded images.
> Doing:
> java -jar tika-app-1.5.jar --extract tika.pdf
> Tika can not find the image.
> Is this a PDF related problem? Because if i do the same operation with a DOC document Tika finds the image correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)