You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Nicklas Karlsson <ni...@gmail.com> on 2012/03/28 08:45:00 UTC

Fonts in pdf to image conversion

Hi,

  I'm using the latest LibreOffice to produce a PDF and the latest PDFBox
to extract the pages as images but I'm having some problems with the fonts.
If I use Times New Roman I get a

org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Changing font on <test> from <Times New Roman> to the default font

  If I embed some more exotic fonts in the PDF, I get a

org.apache.pdfbox.util.PDFStreamEngine processOperator
unsupported/disabled operation: BMC
org.apache.pdfbox.util.PDFStreamEngine processOperator
unsupported/disabled operation: EMC
org.apache.pdfbox.util.PDFStreamEngine processOperator
unsupported/disabled operation: BDC
org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Changing font on <test> from <Algerian> to the default font

This is all on the same machine. Is there a special trick in getting the
fonts working?

The extraction is done with something like

PDDocument doc = PDDocument.load(pdf);
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++)
{
PDPage page = (PDPage) pages.get(i);
pics.add(page.convertToImage());
}


Thanks in advance,
  Nik

-- 
---
Nik

Re: Fonts in pdf to image conversion

Posted by Hamed Iravanchi <ir...@gmail.com>.
Hi,

As far as I remember, ICEpdf didn't render right to left languages
correctly.
I'm not sure thou, maybe it is fixed now.

-Hamed

On Wed, Apr 4, 2012 at 11:48 AM, Nicklas Karlsson <ni...@gmail.com>wrote:

> Thanks for the information. I continued my search for libraries and
> stumbled on ICEpdf from ICEsoft and it works there so you could check for
> hints in their source code while improving on PDFBox ;-)
>
> On Wed, Apr 4, 2012 at 9:57 AM, Hamed Iravanchi <ir...@gmail.com>
> wrote:
>
> > Hi Nicklas,
> >
> > I've been working on this issue for a while.
> > Right now, PDFBox can not convert PDF files created by Open Office or
> Libre
> > Office to images correctly.
> > In my tests, PDF files created by Microsoft Word do not have this problem
> > in the latest Trunk code.
> >
> > This is due to using extracted text to render the image, rather than
> using
> > code points.
> > Andreas used to reply my emails so we could collaborate and resolve such
> > issues faster, but I haven't received any reply lately.
> > I don't know if I'm posting in the right place or not thou...
> >
> > Anyway, to fix this issue for True Type fonts (which are typically used
> in
> > your case) following things should be done by PDFBox:
> > - It should use code points for all true type fonts, instead of extracted
> > text
> > - The code points should be mapped to glyph codes using the font's CMAP
> > - Glyph codes should be used to draw text on the image.
> >
> > I just managed to fix this yesterday in my code for my sample PDF files,
> by
> > modifying the trunk code.
> > But I'm waiting for developer team to collaborate so that I can make sure
> > what I'm doing is right and doesn't break other parts in PDFBox.
> >
> > -Hamed
> >
> >
> > On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <nickarls@gmail.com
> > >wrote:
> >
> > > Hi,
> > >
> > >  I'm using the latest LibreOffice to produce a PDF and the latest
> PDFBox
> > > to extract the pages as images but I'm having some problems with the
> > fonts.
> > > If I use Times New Roman I get a
> > >
> > > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > > Changing font on <test> from <Times New Roman> to the default font
> > >
> > >  If I embed some more exotic fonts in the PDF, I get a
> > >
> > > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > > unsupported/disabled operation: BMC
> > > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > > unsupported/disabled operation: EMC
> > > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > > unsupported/disabled operation: BDC
> > > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > > Changing font on <test> from <Algerian> to the default font
> > >
> > > This is all on the same machine. Is there a special trick in getting
> the
> > > fonts working?
> > >
> > > The extraction is done with something like
> > >
> > > PDDocument doc = PDDocument.load(pdf);
> > > List pages = doc.getDocumentCatalog().getAllPages();
> > > for (int i = 0; i < pages.size(); i++)
> > > {
> > > PDPage page = (PDPage) pages.get(i);
> > > pics.add(page.convertToImage());
> > > }
> > >
> > >
> > > Thanks in advance,
> > >  Nik
> > >
> > > --
> > > ---
> > > Nik
> > >
> >
>
>
>
> --
> ---
> Nik
>

Re: Fonts in pdf to image conversion

Posted by Nicklas Karlsson <ni...@gmail.com>.
Thanks for the information. I continued my search for libraries and
stumbled on ICEpdf from ICEsoft and it works there so you could check for
hints in their source code while improving on PDFBox ;-)

On Wed, Apr 4, 2012 at 9:57 AM, Hamed Iravanchi <ir...@gmail.com> wrote:

> Hi Nicklas,
>
> I've been working on this issue for a while.
> Right now, PDFBox can not convert PDF files created by Open Office or Libre
> Office to images correctly.
> In my tests, PDF files created by Microsoft Word do not have this problem
> in the latest Trunk code.
>
> This is due to using extracted text to render the image, rather than using
> code points.
> Andreas used to reply my emails so we could collaborate and resolve such
> issues faster, but I haven't received any reply lately.
> I don't know if I'm posting in the right place or not thou...
>
> Anyway, to fix this issue for True Type fonts (which are typically used in
> your case) following things should be done by PDFBox:
> - It should use code points for all true type fonts, instead of extracted
> text
> - The code points should be mapped to glyph codes using the font's CMAP
> - Glyph codes should be used to draw text on the image.
>
> I just managed to fix this yesterday in my code for my sample PDF files, by
> modifying the trunk code.
> But I'm waiting for developer team to collaborate so that I can make sure
> what I'm doing is right and doesn't break other parts in PDFBox.
>
> -Hamed
>
>
> On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <nickarls@gmail.com
> >wrote:
>
> > Hi,
> >
> >  I'm using the latest LibreOffice to produce a PDF and the latest PDFBox
> > to extract the pages as images but I'm having some problems with the
> fonts.
> > If I use Times New Roman I get a
> >
> > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > Changing font on <test> from <Times New Roman> to the default font
> >
> >  If I embed some more exotic fonts in the PDF, I get a
> >
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: BMC
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: EMC
> > org.apache.pdfbox.util.PDFStreamEngine processOperator
> > unsupported/disabled operation: BDC
> > org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> > Changing font on <test> from <Algerian> to the default font
> >
> > This is all on the same machine. Is there a special trick in getting the
> > fonts working?
> >
> > The extraction is done with something like
> >
> > PDDocument doc = PDDocument.load(pdf);
> > List pages = doc.getDocumentCatalog().getAllPages();
> > for (int i = 0; i < pages.size(); i++)
> > {
> > PDPage page = (PDPage) pages.get(i);
> > pics.add(page.convertToImage());
> > }
> >
> >
> > Thanks in advance,
> >  Nik
> >
> > --
> > ---
> > Nik
> >
>



-- 
---
Nik

Re: Fonts in pdf to image conversion

Posted by Hamed Iravanchi <ir...@gmail.com>.
Hi Nicklas,

I've been working on this issue for a while.
Right now, PDFBox can not convert PDF files created by Open Office or Libre
Office to images correctly.
In my tests, PDF files created by Microsoft Word do not have this problem
in the latest Trunk code.

This is due to using extracted text to render the image, rather than using
code points.
Andreas used to reply my emails so we could collaborate and resolve such
issues faster, but I haven't received any reply lately.
I don't know if I'm posting in the right place or not thou...

Anyway, to fix this issue for True Type fonts (which are typically used in
your case) following things should be done by PDFBox:
- It should use code points for all true type fonts, instead of extracted
text
- The code points should be mapped to glyph codes using the font's CMAP
- Glyph codes should be used to draw text on the image.

I just managed to fix this yesterday in my code for my sample PDF files, by
modifying the trunk code.
But I'm waiting for developer team to collaborate so that I can make sure
what I'm doing is right and doesn't break other parts in PDFBox.

-Hamed


On Wed, Mar 28, 2012 at 11:15 AM, Nicklas Karlsson <ni...@gmail.com>wrote:

> Hi,
>
>  I'm using the latest LibreOffice to produce a PDF and the latest PDFBox
> to extract the pages as images but I'm having some problems with the fonts.
> If I use Times New Roman I get a
>
> org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> Changing font on <test> from <Times New Roman> to the default font
>
>  If I embed some more exotic fonts in the PDF, I get a
>
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: BMC
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: EMC
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> unsupported/disabled operation: BDC
> org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
> Changing font on <test> from <Algerian> to the default font
>
> This is all on the same machine. Is there a special trick in getting the
> fonts working?
>
> The extraction is done with something like
>
> PDDocument doc = PDDocument.load(pdf);
> List pages = doc.getDocumentCatalog().getAllPages();
> for (int i = 0; i < pages.size(); i++)
> {
> PDPage page = (PDPage) pages.get(i);
> pics.add(page.convertToImage());
> }
>
>
> Thanks in advance,
>  Nik
>
> --
> ---
> Nik
>