You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Serban Alexe <se...@gmail.com> on 2018/02/01 16:14:00 UTC
Convert PDF to HTML with PDFBox in a Java app - Need some
introductory info & guidance
Hello everybody,
I need to write a Java class that converts a *.pdf* document to the html
format, preferably keeping the original formatting to the best extent
possible.
Also, I need to be able to extract the images (and preferably encode them
as base64 in the html file).
*Can you please provide me some useful starting points and/or examples ? *
Through google search, I was able to find some limited functionality
examples. None of these deal with images, and also my guess is that they
refer to some older version of the PDFBox suite...
Thank you,
Serban
Re: Convert PDF to HTML with PDFBox in a Java app - Need some
introductory info & guidance
Posted by Jason Harrop <jh...@gmail.com>.
https://github.com/FitLayout/PDFAnalyzer is promising
On 2 Feb. 2018 3:31 am, "Serban Alexe" <se...@gmail.com> wrote:
> Hello everybody,
>
> I need to write a Java class that converts a *.pdf* document to the html
> format, preferably keeping the original formatting to the best extent
> possible.
> Also, I need to be able to extract the images (and preferably encode them
> as base64 in the html file).
>
> *Can you please provide me some useful starting points and/or examples ? *
>
> Through google search, I was able to find some limited functionality
> examples. None of these deal with images, and also my guess is that they
> refer to some older version of the PDFBox suite...
>
> Thank you,
>
> Serban
>
Re: Convert PDF to HTML with PDFBox in a Java app - Need some
introductory info & guidance
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
Please have a look at the PDFText2HTML class in the source code
download. There is also an ExtractImages and a PrintImageLocations
class, but each of them is alone... you'll never get something like a
PDF because PDF and HTML are really two different things.
Tilman
Am 01.02.2018 um 17:14 schrieb Serban Alexe:
> Hello everybody,
>
> I need to write a Java class that converts a *.pdf* document to the html
> format, preferably keeping the original formatting to the best extent
> possible.
> Also, I need to be able to extract the images (and preferably encode them
> as base64 in the html file).
>
> *Can you please provide me some useful starting points and/or examples ? *
>
> Through google search, I was able to find some limited functionality
> examples. None of these deal with images, and also my guess is that they
> refer to some older version of the PDFBox suite...
>
> Thank you,
>
> Serban
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org