You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Vaishali Mahajan <va...@gmail.com> on 2023/03/20 05:40:05 UTC

PDF to Word Conversion

Hi,

Creating PDF to Word conversion application using PDFbox .net version.
Getting all text from pdf but without formatting. I want to Preserve the
formatting of text as well as all images from pdf to word files. Please
guide me.


Thanks

Re: PDF to Word Conversion

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

There is no PDFBox .net version. There is some unofficial stuff from old 
PDFBox versions.

There is no "formatting" in PDF like in HTML. Glyphs are put at 
specified places, sometimes 1 character at a time.
There are products that try to recreate paragraphs from this. Even 
PDFBox tries this, but it's not perfect, see PDFText2HTML.java .
To get the images, see the ExtractImages.java and 
PrintImageLocations.java . One would have to combine all this, and it 
would still not look very close to a PDF.

Tilman

On 20.03.2023 06:40, Vaishali Mahajan wrote:
> Hi,
>
> Creating PDF to Word conversion application using PDFbox .net version.
> Getting all text from pdf but without formatting. I want to Preserve the
> formatting of text as well as all images from pdf to word files. Please
> guide me.
>
>
> Thanks
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org