You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/09/09 12:53:00 UTC

[jira] [Commented] (PDFBOX-3926) ExtractImages

    [ https://issues.apache.org/jira/browse/PDFBOX-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159927#comment-16159927 ] 

Tilman Hausherr commented on PDFBOX-3926:
-----------------------------------------

This would be very difficult, we would have to reproduce the layout that is in PDF with HTML. It will almost never be perfect. Images can appear as ordinary images or as backgrounds or as masks.

I think Apache Tika has something in that direction, but likely not perfect.

I intend to close this as "won't fix".

> ExtractImages 
> --------------
>
>                 Key: PDFBOX-3926
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3926
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Hasan Karaoğlu
>
> Hi, I extract texts from pdf by below command. But it doesnt extract images. And So, I use extract images command. But how can we merge these two data sequentially?
> Extract Texts: (First command)
> {code:java}
> java -jar pdfbox.jar ExtractText -html {{inputFileName}} -startPage {{startPage}} -endPage {{endPage}} -encoding UTF-8  {{outputFileName}}
> {code}
> Extract Images: (Second command)
> {code:java}
> java -jar pdfbox-app.jar ExtractImages [OPTIONS] <inputfile>
> {code}
> For example I run first command and I have a output.html file. But this file has just text parts of page. There is no image. And I run second command , I get  image as file. Then, How can I merge these two seperated files. Order of elements in page is important. 
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org