You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Serban Alexe <se...@gmail.com> on 2018/02/01 16:14:00 UTC

Convert PDF to HTML with PDFBox in a Java app - Need some introductory info & guidance

Hello everybody,

I need to write a Java class that converts a *.pdf* document to the html
format, preferably keeping the original formatting to the best extent
possible.
Also, I need to be able to extract the images (and preferably encode them
as base64 in the html file).

*Can you please provide me some useful starting points and/or examples ? *

Through google search, I was able to find some limited functionality
examples. None of these deal with images, and also my guess is that they
refer to some older version of the PDFBox suite...

Thank you,

Serban

Re: Convert PDF to HTML with PDFBox in a Java app - Need some introductory info & guidance

Posted by Jason Harrop <jh...@gmail.com>.

https://github.com/FitLayout/PDFAnalyzer is promising

On 2 Feb. 2018 3:31 am, "Serban Alexe" <se...@gmail.com> wrote:

> Hello everybody,
>
> I need to write a Java class that converts a *.pdf* document to the html
> format, preferably keeping the original formatting to the best extent
> possible.
> Also, I need to be able to extract the images (and preferably encode them
> as base64 in the html file).
>
> *Can you please provide me some useful starting points and/or examples ? *
>
> Through google search, I was able to find some limited functionality
> examples. None of these deal with images, and also my guess is that they
> refer to some older version of the PDFBox suite...
>
> Thank you,
>
> Serban
>

Re: Convert PDF to HTML with PDFBox in a Java app - Need some introductory info & guidance

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

Please have a look at the PDFText2HTML class in the source code 
download. There is also an ExtractImages and a PrintImageLocations 
class, but each of them is alone... you'll never get something like a 
PDF because PDF and HTML are really two different things.

Tilman

Am 01.02.2018 um 17:14 schrieb Serban Alexe:
> Hello everybody,
>
> I need to write a Java class that converts a *.pdf* document to the html
> format, preferably keeping the original formatting to the best extent
> possible.
> Also, I need to be able to extract the images (and preferably encode them
> as base64 in the html file).
>
> *Can you please provide me some useful starting points and/or examples ? *
>
> Through google search, I was able to find some limited functionality
> examples. None of these deal with images, and also my guess is that they
> refer to some older version of the PDFBox suite...
>
> Thank you,
>
> Serban
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org