You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Lukas Baab <19...@web.de> on 2013/03/07 10:23:10 UTC

use embedded fonts to write text

Hi!

I want to read the text of a pdf and just write it again on the page.

In theory this is simple: Use the PDFStreamEngine to get all TextPositions of a page. The TextPosition has everything you need to write the text with the same font... at the right place. Code see below. Complete example see attachment.

Unfortunately it is not that easy: Whether this solution works or not depends on the font of the text and how the text is embedded into the pdf.

Questions:
What type of font/type of font-embedding are supported by PdfBox? (What type is supported to reuse in the pdf?)
Do I have to handle different embedded fonts differently? How?
How can I check whether I can write some text with a font or not?

I appreciate every kind of advice and answer!

Thanks
Lukas



Attachment:
Code of TextReprintExample
exampleFiles:
  example 1: created with LibreOffice, the whole text is reprinted with wrong characters
  example 2: created with Word, the text is reprinted correctly, but special characters ( „ and “ ) are not reprinted



Here the code:

public void reprintTextTest() throws Exception {
  PDDocument document = PDDocument.load("E:/80_tmp/test.pdf");
  List<PDPage> allPages = document.getDocumentCatalog().getAllPages();

  for (PDPage page : allPages) {
    List<TextPosition> textPositionsOfPage = getTextPosition(page);
    writeText(document, page, textPositionsOfPage);
  }

  document.save("E:/80_tmp/test-result.pdf");
  document.close();
}

private void writeText(PDDocument document, PDPage page, List<TextPosition> textPositions) throws IOException {
  float pageHeight = page.findMediaBox().getHeight();
  PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, true);
  pageContentStream.setNonStrokingColor(Color.GREEN);

  for (TextPosition textPosition : textPositions) {
    float x = textPosition.getX();
    float y = pageHeight - textPosition.getY();
    pageContentStream.beginText();
    pageContentStream.moveTextPositionByAmount(x, y);
    pageContentStream.setFont(textPosition.getFont(), textPosition.getFontSize());
    pageContentStream.drawString(textPosition.getCharacter());
    pageContentStream.endText();
  }

  pageContentStream.close();
}

Re: use embedded fonts to write text

Posted by Lukas Baab <19...@web.de>.
Hi Maruan,

I checked that in the PDF are really characters and not only vectors. So I should be able to reuse these characters to print the same text again in the pdf. At least if PDFBox supports this.

Regarding the encoding of the text:
Whether PdfBox is able to write some text with a specific font depends on the encoding of the font. (PdfBox only supports WinAnsiEncoding.) Does it only depend on the encoding of the font or also on something else?

And one last question that is important for me:
How can I see automatically, whether PDFBox is able to reprint a character? PDFont.getEncoding is protected.

Thanks for all your help!
Lukas





> "Maruan Sahyoun" wrote:
> if what you visually see in a PDF is really a character (could be vectors) and the information is available in the PDF yes it should be possible to reuse and reprint that. A good way to quickly check if the text on you see on screen is represented as characters (with correct font information, mapping …) cut and paste the text from within Adobe Reader. Also please be aware that as far as I know PDFBox has limitations for supporting different text encodings when writing text. As I have never used PDFBox for such applications I might be wrong though. I think you are save if the characters you are trying to handle are within WinAnsiEncoding.
> Maruan Sahyoun
>
> Am 08.03.2013 um 10:37 schrieb "Lukas Baab" <19...@web.de>:
>>
>> Hi!
>>
>> Thanks Maruan for your answer.
>>
>> One more question:
>> No matter what fonts, kind of font and kind of encoding... are used in the pdf, in theory it should be possible to reprint a glyph that is already in the pdf, shouldn´t it?
>> Because the glyph is in the font for sure (the glyph is printed in the original pdf, so it must be in the font somewhere) it "must" be possible to print this glyph again!?
>> In my example also the step from glyph to character works. Because I can print the the character of the glyph on console I know this works. Therefore (I think and guess) it should be possible to get the glyph for the character again. (At least it should not be a problem of the CMap or something like this!?)
>>
>> That´s the theory. Is this correct?
>> And just one little more question: How can I do this? :)
>>
>> Thanks to all for your help and the great work you all have already done with PdfBox!
>> Lukas
>>
>>
>>
>>> "Maruan Sahyoun" wrote:
>>> Hi Lukas,
>>>
>>> There are different font formats specified in the PDF specification. They are supported from within PDFBox through the PDFont class [1] and it's subclasses and the fontbox lib. Not all of these fonts have 'real' characters but might just be 'curves'. Fonts can also be embedded or just linked from the PDF. Let's assume the text you are trying to reprint is based on a TrueType font which is embedded. Then there is something called 'subsetting'. That means that not all characters of a font are embedded into the PDF but only the characters needed to represent the current text. Then there is encoding ….
>>>
>>> So the code you are presenting only works in certain cases as you already found out. A complete description of the PDF font handling can be found in section 9.2 of the ISO-32000 (PDF) spec.
>>>
>>> I would need to review the samples you attached to give you some more hints. Unfortunately I won't have tome before start of next week to do so. Maybe other people will provide some additional information of you.
>>>
>>> Maruan Sahyoun
>>>
>>> [1] http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/font/PDFont.html
>>>
>>>
>>>> Am 07.03.2013 um 10:23 schrieb Lukas Baab <19...@web.de>:
>>>> Hi!
>>>>
>>>> I want to read the text of a pdf and just write it again on the page.
>>>>
>>>> In theory this is simple: Use the PDFStreamEngine to get all TextPositions of a page. The TextPosition has everything you need to write the text with the same font... at the right place. Code see below. Complete example see attachment.
>>>>
>>>> Unfortunately it is not that easy: Whether this solution works or not depends on the font of the text and how the text is embedded into the pdf.
>>>>
>>>> Questions:
>>>> What type of font/type of font-embedding are supported by PdfBox? (What type is supported to reuse in the pdf?)
>>>> Do I have to handle different embedded fonts differently? How?
>>>> How can I check whether I can write some text with a font or not?
>>>>
>>>> I appreciate every kind of advice and answer!
>>>>
>>>> Thanks
>>>> Lukas
>>>>
>>>>
>>>>
>>>> Attachment:
>>>> Code of TextReprintExample
>>>> exampleFiles:
>>>> example 1: created with LibreOffice, the whole text is reprinted with wrong characters
>>>> example 2: created with Word, the text is reprinted correctly, but special characters ( „ and “ ) are not reprinted
>>>>
>>>>
>>>>
>>>> Here the code:
>>>>
>>>> public void reprintTextTest() throws Exception {
>>>> PDDocument document = PDDocument.load("E:/80_tmp/test.pdf");
>>>> List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
>>>>
>>>> for (PDPage page : allPages) {
>>>> List<TextPosition> textPositionsOfPage = getTextPosition(page);
>>>> writeText(document, page, textPositionsOfPage);
>>>> }
>>>>
>>>> document.save("E:/80_tmp/test-result.pdf");
>>>> document.close();
>>>> }
>>>>
>>>> private void writeText(PDDocument document, PDPage page, List<TextPosition> textPositions) throws IOException {
>>>> float pageHeight = page.findMediaBox().getHeight();
>>>> PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, true);
>>>> pageContentStream.setNonStrokingColor(Color.GREEN);
>>>>
>>>> for (TextPosition textPosition : textPositions) {
>>>> float x = textPosition.getX();
>>>> float y = pageHeight - textPosition.getY();
>>>> pageContentStream.beginText();
>>>> pageContentStream.moveTextPositionByAmount(x, y);
>>>> pageContentStream.setFont(textPosition.getFont(), textPosition.getFontSize());
>>>> pageContentStream.drawString(textPosition.getCharacter());
>>>> pageContentStream.endText();
>>>> }
>>>>
>>>> pageContentStream.close();
>>>> }<exampleFiles.zip><TextReprintExample.java>
>>>>

Re: use embedded fonts to write text

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Lukas,

if what you visually see in a PDF is really a character (could be vectors) and the information is available in the PDF yes it should be possible to reuse and reprint that. A good way to quickly check if the text on you see on screen is represented as characters (with correct font information, mapping …) cut and paste the text from within Adobe Reader. Also please be aware that as far as I know PDFBox has limitations for supporting different text encodings when writing text. As I have never used PDFBox for such applications I might be wrong though. I think you are save if the characters you are trying to handle are within WinAnsiEncoding.

Maruan Sahyoun

Am 08.03.2013 um 10:37 schrieb "Lukas Baab" <19...@web.de>:

> 
> Hi!
> 
> Thanks Maruan for your answer.
> 
> One more question:
> No matter what fonts, kind of font and kind of encoding... are used in the pdf, in theory it should be possible to reprint a glyph that is already in the pdf, shouldn´t it?
> Because the glyph is in the font for sure (the glyph is printed in the original pdf, so it must be in the font somewhere) it "must" be possible to print this glyph again!?
> In my example also the step from glyph to character works. Because I can print the the character of the glyph on console I know this works. Therefore (I think and guess) it should be possible to get the glyph for the character again. (At least it should not be a problem of the CMap or something like this!?)
> 
> That´s the theory. Is this correct?
> And just one little more question: How can I do this?  :)
> 
> Thanks to all for your help and the great work you all have already done with PdfBox!
> Lukas
> 
> 
> 
>> "Maruan Sahyoun" wrote:
>> Hi Lukas,
>> 
>> There are different font formats specified in the PDF specification. They are supported from within PDFBox through the PDFont class [1] and it's subclasses and the fontbox lib. Not all of these fonts have 'real' characters but might just be 'curves'. Fonts can also be embedded or just linked from the PDF. Let's assume the text you are trying to reprint is based on a TrueType font which is embedded. Then there is something called 'subsetting'. That means that not all characters of a font are embedded into the PDF but only the characters needed to represent the current text. Then there is encoding ….
>> 
>> So the code you are presenting only works in certain cases as you already found out. A complete description of the PDF font handling can be found in section 9.2 of the ISO-32000 (PDF) spec.
>> 
>> I would need to review the samples you attached to give you some more hints. Unfortunately I won't have tome before start of next week to do so. Maybe other people will provide some additional information of you.
>> 
>> Maruan Sahyoun
>> 
>> [1] http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/font/PDFont.html
>> 
>> 
>>> Am 07.03.2013 um 10:23 schrieb Lukas Baab <19...@web.de>:
>>> Hi!
>>> 
>>> I want to read the text of a pdf and just write it again on the page.
>>> 
>>> In theory this is simple: Use the PDFStreamEngine to get all TextPositions of a page. The TextPosition has everything you need to write the text with the same font... at the right place. Code see below. Complete example see attachment.
>>> 
>>> Unfortunately it is not that easy: Whether this solution works or not depends on the font of the text and how the text is embedded into the pdf.
>>> 
>>> Questions:
>>> What type of font/type of font-embedding are supported by PdfBox? (What type is supported to reuse in the pdf?)
>>> Do I have to handle different embedded fonts differently? How?
>>> How can I check whether I can write some text with a font or not?
>>> 
>>> I appreciate every kind of advice and answer!
>>> 
>>> Thanks
>>> Lukas
>>> 
>>> 
>>> 
>>> Attachment:
>>> Code of TextReprintExample
>>> exampleFiles:
>>> example 1: created with LibreOffice, the whole text is reprinted with wrong characters
>>> example 2: created with Word, the text is reprinted correctly, but special characters ( „ and “ ) are not reprinted
>>> 
>>> 
>>> 
>>> Here the code:
>>> 
>>> public void reprintTextTest() throws Exception {
>>> PDDocument document = PDDocument.load("E:/80_tmp/test.pdf");
>>> List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
>>> 
>>> for (PDPage page : allPages) {
>>> List<TextPosition> textPositionsOfPage = getTextPosition(page);
>>> writeText(document, page, textPositionsOfPage);
>>> }
>>> 
>>> document.save("E:/80_tmp/test-result.pdf");
>>> document.close();
>>> }
>>> 
>>> private void writeText(PDDocument document, PDPage page, List<TextPosition> textPositions) throws IOException {
>>> float pageHeight = page.findMediaBox().getHeight();
>>> PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, true);
>>> pageContentStream.setNonStrokingColor(Color.GREEN);
>>> 
>>> for (TextPosition textPosition : textPositions) {
>>> float x = textPosition.getX();
>>> float y = pageHeight - textPosition.getY();
>>> pageContentStream.beginText();
>>> pageContentStream.moveTextPositionByAmount(x, y);
>>> pageContentStream.setFont(textPosition.getFont(), textPosition.getFontSize());
>>> pageContentStream.drawString(textPosition.getCharacter());
>>> pageContentStream.endText();
>>> }
>>> 
>>> pageContentStream.close();
>>> }<exampleFiles.zip><TextReprintExample.java>
> 
> 


Re: use embedded fonts to write text

Posted by Lukas Baab <19...@web.de>.
Hi!

Thanks Maruan for your answer.

One more question:
No matter what fonts, kind of font and kind of encoding... are used in the pdf, in theory it should be possible to reprint a glyph that is already in the pdf, shouldn´t it?
Because the glyph is in the font for sure (the glyph is printed in the original pdf, so it must be in the font somewhere) it "must" be possible to print this glyph again!?
In my example also the step from glyph to character works. Because I can print the the character of the glyph on console I know this works. Therefore (I think and guess) it should be possible to get the glyph for the character again. (At least it should not be a problem of the CMap or something like this!?)

That´s the theory. Is this correct?
And just one little more question: How can I do this?  :)

Thanks to all for your help and the great work you all have already done with PdfBox!
Lukas



> "Maruan Sahyoun" wrote:
> Hi Lukas,
>
> There are different font formats specified in the PDF specification. They are supported from within PDFBox through the PDFont class [1] and it's subclasses and the fontbox lib. Not all of these fonts have 'real' characters but might just be 'curves'. Fonts can also be embedded or just linked from the PDF. Let's assume the text you are trying to reprint is based on a TrueType font which is embedded. Then there is something called 'subsetting'. That means that not all characters of a font are embedded into the PDF but only the characters needed to represent the current text. Then there is encoding ….
>
> So the code you are presenting only works in certain cases as you already found out. A complete description of the PDF font handling can be found in section 9.2 of the ISO-32000 (PDF) spec.
>
> I would need to review the samples you attached to give you some more hints. Unfortunately I won't have tome before start of next week to do so. Maybe other people will provide some additional information of you.
>
> Maruan Sahyoun
>
> [1] http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/font/PDFont.html
>
>
>>Am 07.03.2013 um 10:23 schrieb Lukas Baab <19...@web.de>:
>> Hi!
>>
>> I want to read the text of a pdf and just write it again on the page.
>>
>> In theory this is simple: Use the PDFStreamEngine to get all TextPositions of a page. The TextPosition has everything you need to write the text with the same font... at the right place. Code see below. Complete example see attachment.
>>
>> Unfortunately it is not that easy: Whether this solution works or not depends on the font of the text and how the text is embedded into the pdf.
>>
>> Questions:
>> What type of font/type of font-embedding are supported by PdfBox? (What type is supported to reuse in the pdf?)
>> Do I have to handle different embedded fonts differently? How?
>> How can I check whether I can write some text with a font or not?
>>
>> I appreciate every kind of advice and answer!
>>
>> Thanks
>> Lukas
>>
>>
>>
>> Attachment:
>> Code of TextReprintExample
>> exampleFiles:
>> example 1: created with LibreOffice, the whole text is reprinted with wrong characters
>> example 2: created with Word, the text is reprinted correctly, but special characters ( „ and “ ) are not reprinted
>>
>>
>>
>> Here the code:
>>
>> public void reprintTextTest() throws Exception {
>> PDDocument document = PDDocument.load("E:/80_tmp/test.pdf");
>> List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
>>
>> for (PDPage page : allPages) {
>> List<TextPosition> textPositionsOfPage = getTextPosition(page);
>> writeText(document, page, textPositionsOfPage);
>> }
>>
>> document.save("E:/80_tmp/test-result.pdf");
>> document.close();
>> }
>>
>> private void writeText(PDDocument document, PDPage page, List<TextPosition> textPositions) throws IOException {
>> float pageHeight = page.findMediaBox().getHeight();
>> PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, true);
>> pageContentStream.setNonStrokingColor(Color.GREEN);
>>
>> for (TextPosition textPosition : textPositions) {
>> float x = textPosition.getX();
>> float y = pageHeight - textPosition.getY();
>> pageContentStream.beginText();
>> pageContentStream.moveTextPositionByAmount(x, y);
>> pageContentStream.setFont(textPosition.getFont(), textPosition.getFontSize());
>> pageContentStream.drawString(textPosition.getCharacter());
>> pageContentStream.endText();
>> }
>>
>> pageContentStream.close();
>> }<exampleFiles.zip><TextReprintExample.java>



Re: use embedded fonts to write text

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Lukas,

There are different font formats specified in the PDF specification. They are supported from within PDFBox through the PDFont class [1] and it's subclasses and the fontbox lib. Not all of these fonts have 'real' characters but might just be 'curves'. Fonts can also be embedded or just linked from the PDF. Let's assume the text you are trying to reprint is based on a TrueType font which is embedded. Then there is something called 'subsetting'. That means that not all characters of a font are embedded into the PDF but only the characters needed to represent the current text. Then there is encoding ….

So the code you are presenting only works in certain cases as you already found out. A complete description of the PDF font handling can be found in section 9.2 of the ISO-32000 (PDF) spec. 

I would need to review the samples you attached to give you some more hints. Unfortunately I won't have tome before start of next week to do so. Maybe other people will provide some additional information of you. 

Maruan Sahyoun

[1] http://pdfbox.apache.org/apidocs/org/apache/pdfbox/pdmodel/font/PDFont.html


Am 07.03.2013 um 10:23 schrieb Lukas Baab <19...@web.de>:

> 
> Hi!
> 
> I want to read the text of a pdf and just write it again on the page.
> 
> In theory this is simple: Use the PDFStreamEngine to get all TextPositions of a page. The TextPosition has everything you need to write the text with the same font... at the right place. Code see below. Complete example see attachment.
> 
> Unfortunately it is not that easy: Whether this solution works or not depends on the font of the text and how the text is embedded into the pdf.
> 
> Questions:
> What type of font/type of font-embedding are supported by PdfBox? (What type is supported to reuse in the pdf?)
> Do I have to handle different embedded fonts differently? How?
> How can I check whether I can write some text with a font or not?
> 
> I appreciate every kind of advice and answer!
> 
> Thanks
> Lukas
> 
> 
> 
> Attachment:
> Code of TextReprintExample
> exampleFiles:
>  example 1: created with LibreOffice, the whole text is reprinted with wrong characters
>  example 2: created with Word, the text is reprinted correctly, but special characters ( „ and “ ) are not reprinted
> 
> 
> 
> Here the code:
> 
> public void reprintTextTest() throws Exception {
>  PDDocument document = PDDocument.load("E:/80_tmp/test.pdf");
>  List<PDPage> allPages = document.getDocumentCatalog().getAllPages();
> 
>  for (PDPage page : allPages) {
>    List<TextPosition> textPositionsOfPage = getTextPosition(page);
>    writeText(document, page, textPositionsOfPage);
>  }
> 
>  document.save("E:/80_tmp/test-result.pdf");
>  document.close();
> }
> 
> private void writeText(PDDocument document, PDPage page, List<TextPosition> textPositions) throws IOException {
>  float pageHeight = page.findMediaBox().getHeight();
>  PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, true);
>  pageContentStream.setNonStrokingColor(Color.GREEN);
> 
>  for (TextPosition textPosition : textPositions) {
>    float x = textPosition.getX();
>    float y = pageHeight - textPosition.getY();
>    pageContentStream.beginText();
>    pageContentStream.moveTextPositionByAmount(x, y);
>    pageContentStream.setFont(textPosition.getFont(), textPosition.getFontSize());
>    pageContentStream.drawString(textPosition.getCharacter());
>    pageContentStream.endText();
>  }
> 
>  pageContentStream.close();
> }<exampleFiles.zip><TextReprintExample.java>