You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Tamas Kocsis <mr...@gmail.com> on 2021/02/12 09:46:46 UTC

Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Hi Everyone!

We faced the issue described in this SO question:

https://stackoverflow.com/questions/61934819/pdfbox-no-glyph-for-u0050-in-extracted-font


The TODO in question is in PDCIDFontType2's encode method:

*//TODO: invert the ToUnicode CMap?*


I just wanted to ask whether you have any info on the implementation of
this one?
Is it on the roadmap, planned in the near/far future or still open...?

Best Regards: Tamas Kocsis

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tilman Hausherr <TH...@t-online.de>.

Thanks for the feedback! It has been fixed here:
https://issues.apache.org/jira/browse/PDFBOX-5103

There is also a snapshot build here
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.23-SNAPSHOT/


Tilman


Am 17.02.2021 um 14:34 schrieb Tamas Kocsis:
> It works!
> TestFontEmbedding succeeded and I also executed my own test successfully.
> There were some bumps, but no roadblocks :)
>
> Thank you for your help Tilman - I really appreciate it!
>
> On Tue, Feb 16, 2021 at 6:14 AM Tamas Kocsis <mr...@gmail.com>
> wrote:
>
>> Thank you!
>> I'll give it a try and let you know.
>>
>> On Mon, Feb 15, 2021 at 6:06 PM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 15.02.2021 um 10:32 schrieb Tamas Kocsis:
>>>> Thanks for the info and for looking into it.
>>>> Never tried building PDFBox from source, but I guess I could do it.
>>> Would
>>>> be nice if I could test this with 2.0...
>>> OK here's some code. If you can't get it run (don't waste too much time
>>> if you hit roadblocks) then I'll create an issue and commit and build a
>>> snapshot.
>>>
>>> PDFont.java:
>>>
>>>
>>>
>>>       /**
>>>        * Get the /ToUnicode CMap.
>>>        *
>>>        * @return The /ToUnicode CMap or null if there is none.
>>>        */
>>>       protected CMap getToUnicodeCMap()
>>>       {
>>>           return toUnicodeCMap;
>>>       }
>>>
>>> PDCIDFontType2.java:
>>>
>>> add this at the place mentioned in your first post
>>>
>>>                   byte[] codes =
>>>
>>> parent.getToUnicodeCMap().getCodesFromUnicode(Character.toString((char)unicode));
>>>                   if (codes != null)
>>>                   {
>>>                       return codes;
>>>                   }
>>>
>>>
>>> in CMap.java
>>>
>>>
>>> add
>>>
>>> unicodeToByteCodes.put(unicode, codes.clone()); // clone needed, bytes
>>> is modified later
>>>
>>> as first line of the method addCharMapping()
>>>
>>>
>>> also add these in the clas
>>>
>>>       // inverted map
>>>       Map <String, byte[]> unicodeToByteCodes = new HashMap<String,
>>> byte[]>();
>>>
>>>
>>>       /**
>>>        * Get the code bytes for an unicode string.
>>>        *
>>>        * @param unicode
>>>        * @return the code bytes or null if there is none.
>>>        */
>>>       public byte[] getCodesFromUnicode(String unicode)
>>>       {
>>>           return unicodeToByteCodes.get(unicode);
>>>       }
>>>
>>>
>>> and a test, for TestFontEmbedding.java . If the test runs then you're
>>> successful
>>>
>>>
>>>
>>>       /**
>>>        * Test that an embedded and subsetted font can be reused.
>>>        *
>>>        * @throws IOException
>>>        */
>>>       public void testReuseEmbeddedSubsettedFont() throws IOException
>>>       {
>>>           String text1 = "The quick brown fox";
>>>           String text2 = "xof nworb kciuq ehT";
>>>           ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>>           PDDocument document = new PDDocument();
>>>           PDPage page = new PDPage();
>>>           document.addPage(page);
>>>           InputStream input = PDFont.class.getResourceAsStream(
>>> "/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf");
>>>           PDType0Font font = PDType0Font.load(document, input);
>>>           PDPageContentStream stream = new PDPageContentStream(document,
>>> page);
>>>           stream.beginText();
>>>           stream.setFont(font, 20);
>>>           stream.newLineAtOffset(50, 600);
>>>           stream.showText(text1);
>>>           stream.endText();
>>>           stream.close();
>>>           document.save(baos);
>>>           document.close();
>>>           // Append, while reusing the font subset
>>>           document = PDDocument.load(baos.toByteArray());
>>>           page = document.getPage(0);
>>>           font = (PDType0Font)
>>> page.getResources().getFont(COSName.getPDFName("F1"));
>>>           stream = new PDPageContentStream(document, page,
>>> PDPageContentStream.AppendMode.APPEND, true);
>>>           stream.beginText();
>>>           stream.setFont(font, 20);
>>>           stream.newLineAtOffset(250, 600);
>>>           stream.showText(text2);
>>>           stream.endText();
>>>           stream.close();
>>>           baos.reset();
>>>           document.save(baos);
>>>           document.close();
>>>           // Test that both texts are there
>>>           document = PDDocument.load(baos.toByteArray());
>>>           PDFTextStripper stripper = new PDFTextStripper();
>>>           String extractedText = stripper.getText(document);
>>>           assertEquals(text1 + " " + text2, extractedText.trim());
>>>           document.close();
>>>       }
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tamas Kocsis <mr...@gmail.com>.

It works!
TestFontEmbedding succeeded and I also executed my own test successfully.
There were some bumps, but no roadblocks :)

Thank you for your help Tilman - I really appreciate it!

On Tue, Feb 16, 2021 at 6:14 AM Tamas Kocsis <mr...@gmail.com>
wrote:

> Thank you!
> I'll give it a try and let you know.
>
> On Mon, Feb 15, 2021 at 6:06 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 15.02.2021 um 10:32 schrieb Tamas Kocsis:
>> > Thanks for the info and for looking into it.
>> > Never tried building PDFBox from source, but I guess I could do it.
>> Would
>> > be nice if I could test this with 2.0...
>>
>> OK here's some code. If you can't get it run (don't waste too much time
>> if you hit roadblocks) then I'll create an issue and commit and build a
>> snapshot.
>>
>> PDFont.java:
>>
>>
>>
>>      /**
>>       * Get the /ToUnicode CMap.
>>       *
>>       * @return The /ToUnicode CMap or null if there is none.
>>       */
>>      protected CMap getToUnicodeCMap()
>>      {
>>          return toUnicodeCMap;
>>      }
>>
>> PDCIDFontType2.java:
>>
>> add this at the place mentioned in your first post
>>
>>                  byte[] codes =
>>
>> parent.getToUnicodeCMap().getCodesFromUnicode(Character.toString((char)unicode));
>>                  if (codes != null)
>>                  {
>>                      return codes;
>>                  }
>>
>>
>> in CMap.java
>>
>>
>> add
>>
>> unicodeToByteCodes.put(unicode, codes.clone()); // clone needed, bytes
>> is modified later
>>
>> as first line of the method addCharMapping()
>>
>>
>> also add these in the clas
>>
>>      // inverted map
>>      Map <String, byte[]> unicodeToByteCodes = new HashMap<String,
>> byte[]>();
>>
>>
>>      /**
>>       * Get the code bytes for an unicode string.
>>       *
>>       * @param unicode
>>       * @return the code bytes or null if there is none.
>>       */
>>      public byte[] getCodesFromUnicode(String unicode)
>>      {
>>          return unicodeToByteCodes.get(unicode);
>>      }
>>
>>
>> and a test, for TestFontEmbedding.java . If the test runs then you're
>> successful
>>
>>
>>
>>      /**
>>       * Test that an embedded and subsetted font can be reused.
>>       *
>>       * @throws IOException
>>       */
>>      public void testReuseEmbeddedSubsettedFont() throws IOException
>>      {
>>          String text1 = "The quick brown fox";
>>          String text2 = "xof nworb kciuq ehT";
>>          ByteArrayOutputStream baos = new ByteArrayOutputStream();
>>          PDDocument document = new PDDocument();
>>          PDPage page = new PDPage();
>>          document.addPage(page);
>>          InputStream input = PDFont.class.getResourceAsStream(
>> "/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf");
>>          PDType0Font font = PDType0Font.load(document, input);
>>          PDPageContentStream stream = new PDPageContentStream(document,
>> page);
>>          stream.beginText();
>>          stream.setFont(font, 20);
>>          stream.newLineAtOffset(50, 600);
>>          stream.showText(text1);
>>          stream.endText();
>>          stream.close();
>>          document.save(baos);
>>          document.close();
>>          // Append, while reusing the font subset
>>          document = PDDocument.load(baos.toByteArray());
>>          page = document.getPage(0);
>>          font = (PDType0Font)
>> page.getResources().getFont(COSName.getPDFName("F1"));
>>          stream = new PDPageContentStream(document, page,
>> PDPageContentStream.AppendMode.APPEND, true);
>>          stream.beginText();
>>          stream.setFont(font, 20);
>>          stream.newLineAtOffset(250, 600);
>>          stream.showText(text2);
>>          stream.endText();
>>          stream.close();
>>          baos.reset();
>>          document.save(baos);
>>          document.close();
>>          // Test that both texts are there
>>          document = PDDocument.load(baos.toByteArray());
>>          PDFTextStripper stripper = new PDFTextStripper();
>>          String extractedText = stripper.getText(document);
>>          assertEquals(text1 + " " + text2, extractedText.trim());
>>          document.close();
>>      }
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tamas Kocsis <mr...@gmail.com>.

Thank you!
I'll give it a try and let you know.

On Mon, Feb 15, 2021 at 6:06 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 15.02.2021 um 10:32 schrieb Tamas Kocsis:
> > Thanks for the info and for looking into it.
> > Never tried building PDFBox from source, but I guess I could do it. Would
> > be nice if I could test this with 2.0...
>
> OK here's some code. If you can't get it run (don't waste too much time
> if you hit roadblocks) then I'll create an issue and commit and build a
> snapshot.
>
> PDFont.java:
>
>
>
>      /**
>       * Get the /ToUnicode CMap.
>       *
>       * @return The /ToUnicode CMap or null if there is none.
>       */
>      protected CMap getToUnicodeCMap()
>      {
>          return toUnicodeCMap;
>      }
>
> PDCIDFontType2.java:
>
> add this at the place mentioned in your first post
>
>                  byte[] codes =
>
> parent.getToUnicodeCMap().getCodesFromUnicode(Character.toString((char)unicode));
>                  if (codes != null)
>                  {
>                      return codes;
>                  }
>
>
> in CMap.java
>
>
> add
>
> unicodeToByteCodes.put(unicode, codes.clone()); // clone needed, bytes
> is modified later
>
> as first line of the method addCharMapping()
>
>
> also add these in the clas
>
>      // inverted map
>      Map <String, byte[]> unicodeToByteCodes = new HashMap<String,
> byte[]>();
>
>
>      /**
>       * Get the code bytes for an unicode string.
>       *
>       * @param unicode
>       * @return the code bytes or null if there is none.
>       */
>      public byte[] getCodesFromUnicode(String unicode)
>      {
>          return unicodeToByteCodes.get(unicode);
>      }
>
>
> and a test, for TestFontEmbedding.java . If the test runs then you're
> successful
>
>
>
>      /**
>       * Test that an embedded and subsetted font can be reused.
>       *
>       * @throws IOException
>       */
>      public void testReuseEmbeddedSubsettedFont() throws IOException
>      {
>          String text1 = "The quick brown fox";
>          String text2 = "xof nworb kciuq ehT";
>          ByteArrayOutputStream baos = new ByteArrayOutputStream();
>          PDDocument document = new PDDocument();
>          PDPage page = new PDPage();
>          document.addPage(page);
>          InputStream input = PDFont.class.getResourceAsStream(
> "/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf");
>          PDType0Font font = PDType0Font.load(document, input);
>          PDPageContentStream stream = new PDPageContentStream(document,
> page);
>          stream.beginText();
>          stream.setFont(font, 20);
>          stream.newLineAtOffset(50, 600);
>          stream.showText(text1);
>          stream.endText();
>          stream.close();
>          document.save(baos);
>          document.close();
>          // Append, while reusing the font subset
>          document = PDDocument.load(baos.toByteArray());
>          page = document.getPage(0);
>          font = (PDType0Font)
> page.getResources().getFont(COSName.getPDFName("F1"));
>          stream = new PDPageContentStream(document, page,
> PDPageContentStream.AppendMode.APPEND, true);
>          stream.beginText();
>          stream.setFont(font, 20);
>          stream.newLineAtOffset(250, 600);
>          stream.showText(text2);
>          stream.endText();
>          stream.close();
>          baos.reset();
>          document.save(baos);
>          document.close();
>          // Test that both texts are there
>          document = PDDocument.load(baos.toByteArray());
>          PDFTextStripper stripper = new PDFTextStripper();
>          String extractedText = stripper.getText(document);
>          assertEquals(text1 + " " + text2, extractedText.trim());
>          document.close();
>      }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 15.02.2021 um 10:32 schrieb Tamas Kocsis:
> Thanks for the info and for looking into it.
> Never tried building PDFBox from source, but I guess I could do it. Would
> be nice if I could test this with 2.0...

OK here's some code. If you can't get it run (don't waste too much time 
if you hit roadblocks) then I'll create an issue and commit and build a 
snapshot.

PDFont.java:



     /**
      * Get the /ToUnicode CMap.
      *
      * @return The /ToUnicode CMap or null if there is none.
      */
     protected CMap getToUnicodeCMap()
     {
         return toUnicodeCMap;
     }

PDCIDFontType2.java:

add this at the place mentioned in your first post

                 byte[] codes = 
parent.getToUnicodeCMap().getCodesFromUnicode(Character.toString((char)unicode));
                 if (codes != null)
                 {
                     return codes;
                 }


in CMap.java


add

unicodeToByteCodes.put(unicode, codes.clone()); // clone needed, bytes 
is modified later

as first line of the method addCharMapping()


also add these in the clas

     // inverted map
     Map <String, byte[]> unicodeToByteCodes = new HashMap<String, 
byte[]>();


     /**
      * Get the code bytes for an unicode string.
      *
      * @param unicode
      * @return the code bytes or null if there is none.
      */
     public byte[] getCodesFromUnicode(String unicode)
     {
         return unicodeToByteCodes.get(unicode);
     }


and a test, for TestFontEmbedding.java . If the test runs then you're 
successful



     /**
      * Test that an embedded and subsetted font can be reused.
      *
      * @throws IOException
      */
     public void testReuseEmbeddedSubsettedFont() throws IOException
     {
         String text1 = "The quick brown fox";
         String text2 = "xof nworb kciuq ehT";
         ByteArrayOutputStream baos = new ByteArrayOutputStream();
         PDDocument document = new PDDocument();
         PDPage page = new PDPage();
         document.addPage(page);
         InputStream input = PDFont.class.getResourceAsStream(
"/org/apache/pdfbox/resources/ttf/LiberationSans-Regular.ttf");
         PDType0Font font = PDType0Font.load(document, input);
         PDPageContentStream stream = new PDPageContentStream(document, 
page);
         stream.beginText();
         stream.setFont(font, 20);
         stream.newLineAtOffset(50, 600);
         stream.showText(text1);
         stream.endText();
         stream.close();
         document.save(baos);
         document.close();
         // Append, while reusing the font subset
         document = PDDocument.load(baos.toByteArray());
         page = document.getPage(0);
         font = (PDType0Font) 
page.getResources().getFont(COSName.getPDFName("F1"));
         stream = new PDPageContentStream(document, page, 
PDPageContentStream.AppendMode.APPEND, true);
         stream.beginText();
         stream.setFont(font, 20);
         stream.newLineAtOffset(250, 600);
         stream.showText(text2);
         stream.endText();
         stream.close();
         baos.reset();
         document.save(baos);
         document.close();
         // Test that both texts are there
         document = PDDocument.load(baos.toByteArray());
         PDFTextStripper stripper = new PDFTextStripper();
         String extractedText = stripper.getText(document);
         assertEquals(text1 + " " + text2, extractedText.trim());
         document.close();
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tamas Kocsis <mr...@gmail.com>.

Hi Tilman!

Thanks for the info and for looking into it.
Never tried building PDFBox from source, but I guess I could do it. Would
be nice if I could test this with 2.0...

Best Regards: Tamas

On Sat, Feb 13, 2021 at 1:11 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Hi,
>
> There's nothing... That was a comment made in 2014 in PDFBOX-2524. I hit
> that place at a later time and made another comment.
> I had a look at it... two things would have to be done, 1) PDFont should
> allow to retrieve that cMap, and 2) cMap should store an inverted list,
> e.g. to CMap.addCharMapping().
>
> So I tried this and ran the test code from the SO issue, and saved the
> file. And yes, "Protocol" appeared on the second page.
>
> Are you able to build from source? Do you want to test this with 2.0 or
> the trunk?
>
> Tilman
>
> Am 12.02.2021 um 10:46 schrieb Tamas Kocsis:
> > Hi Everyone!
> >
> > We faced the issue described in this SO question:
> >
> >
> https://stackoverflow.com/questions/61934819/pdfbox-no-glyph-for-u0050-in-extracted-font
> >
> >
> > The TODO in question is in PDCIDFontType2's encode method:
> >
> > *//TODO: invert the ToUnicode CMap?*
> >
> >
> > I just wanted to ask whether you have any info on the implementation of
> > this one?
> > Is it on the roadmap, planned in the near/far future or still open...?
> >
> > Best Regards: Tamas Kocsis
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Inquiry for an open PDFBox TODO in PDCIDFontType2.encode method

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

There's nothing... That was a comment made in 2014 in PDFBOX-2524. I hit 
that place at a later time and made another comment.
I had a look at it... two things would have to be done, 1) PDFont should 
allow to retrieve that cMap, and 2) cMap should store an inverted list, 
e.g. to CMap.addCharMapping().

So I tried this and ran the test code from the SO issue, and saved the 
file. And yes, "Protocol" appeared on the second page.

Are you able to build from source? Do you want to test this with 2.0 or 
the trunk?

Tilman

Am 12.02.2021 um 10:46 schrieb Tamas Kocsis:
> Hi Everyone!
>
> We faced the issue described in this SO question:
>
> https://stackoverflow.com/questions/61934819/pdfbox-no-glyph-for-u0050-in-extracted-font
>
>
> The TODO in question is in PDCIDFontType2's encode method:
>
> *//TODO: invert the ToUnicode CMap?*
>
>
> I just wanted to ask whether you have any info on the implementation of
> this one?
> Is it on the roadmap, planned in the near/far future or still open...?
>
> Best Regards: Tamas Kocsis
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org