You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Gino G <gi...@gmail.com> on 2024/01/30 14:52:01 UTC

Fwd: Help with Incorrect Identity-H Mapping

Hello there,

I'm encountering an error in how certain characters are encoded using
PDFBox. The issue exists in all versions of PDFBox, but I'm currently using
3.0.1.

contentStream.showText("äöüß");


The string "äöüß" is used as a test for Unicode characters that PDFBox
needs to render.

var resource = Processor.class.getResource("/OpenSans-Regular.ttf");
var file = Paths.get(resource.toURI()).toFile();
var targetStream = new FileInputStream(file);
var out = PDType0Font.load(PageAssembler.getDocument(), targetStream, false);
contentStream.setFont(out, 20);


To do so, I'm importing a font that I know has the glyphs for all four
special characters (OpenSans downloaded from Google Fonts).
However, this issue can be reproduced using any other Unicode-supported
font.

Executing the code, PDFBox renders the following character
sequence: äöüß.
Clearly an encoding issue.

Using the PDF Debugger, it shows the text rendered as:

/F1 20 Tf
BT
  (\000\205\000f\000\205\000x\000\205\000~\000\205\0019) Tj
ET

Now, as far as I understand from what I've learned while debugging this
issue, \205 is the octal value that uses the glyph at position 133 (decimal
for \205) of the font with the id F1.
Again, looking at the F1 section in the PDF Debugger, the character listed
under the code / CID / GID 133 is indeed Ã, the first "incorrect" character
of the sequence, which is supposed to be "ä"
"ä", however, would be 166, not 133. How does PDFBox get this wrong?

As an aside, if I use showText and use toUnicode(166), PDFBox correctly
renders "ä" in the desired font!

Looking at the "ToUnicode" part of the F1 font, the following string is
displayed.

Could someone please help me figure out what is going on? And hopefully
even help me fix this issue? For more help, I have attached the PDF
document.

Best,
Gino

ToUnicode:

/CIDInit /ProcSet findresource begin
12 dict begin

begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def

/CMapName /Adobe-Identity-UCS def
/CMapType 2 def

1 begincodespacerange
<0000> <FFFF>
endcodespacerange

100 beginbfrange
<0001> <0001> <0000>
<0002> <0002> <000D>
<0003> <0061> <0020>
<0062> <00C1> <00A0>
<00C2> <00F2> <0100>
<00F3> <00FF> <0132>
<0100> <0122> <013F>
<0123> <0124> <021A>
<0125> <0140> <0164>
<0141> <0141> <0192>
<0142> <0147> <01FA>
<0148> <0149> <0218>
<014A> <014B> <02C6>
<014C> <014C> <02C9>
<014D> <0152> <02D8>
<0153> <0159> <0384>
<015A> <015A> <038C>
<015B> <016E> <038E>
<016F> <019A> <03A3>
<019B> <01A6> <0401>
<01A7> <01E8> <040E>
<01E9> <01F4> <0451>
<01F5> <01F6> <045E>
<01F7> <01F8> <0490>
<01F9> <01FE> <1E80>
<01FF> <01FF> <1EF2>
<0200> <0200> <1EF3>
<0201> <0203> <2013>
<0204> <020B> <2017>
<020C> <020E> <2020>
<020F> <020F> <2026>
<0210> <0210> <2030>
<0211> <0212> <2032>
<0213> <0214> <2039>
<0215> <0215> <203C>
<0216> <0216> <2044>
<0217> <0217> <207F>
<0218> <0219> <20A3>
<021A> <021A> <20A7>
<021B> <021B> <20AC>
<021C> <021C> <2105>
<021D> <021D> <2113>
<021E> <021E> <2116>
<021F> <021F> <2122>
<0220> <0220> <2126>
<0221> <0221> <212E>
<0222> <0225> <215B>
<0226> <0226> <2202>
<0227> <0227> <2206>
<0228> <0228> <220F>
<0229> <022A> <2211>
<022B> <022B> <221A>
<022C> <022C> <221E>
<022D> <022D> <222B>
<022E> <022E> <2248>
<022F> <022F> <2260>
<0230> <0231> <2264>
<0232> <0232> <25CA>
<0235> <0235> <0326>
<0237> <0238> <2074>
<0239> <023A> <2077>
<023B> <0246> <2000>
<0247> <0247> <FEFF>
<0248> <0249> <FFFC>
<024A> <024A> <01F0>
<024B> <024B> <02BC>
<024C> <024D> <03D1>
<024E> <024E> <03D6>
<024F> <0250> <1E3E>
<0251> <0252> <1E00>
<0253> <0253> <02F3>
<0254> <0255> <01A0>
<0256> <0257> <01AF>
<0259> <0259> <0400>
<025A> <025A> <040D>
<025B> <025B> <0450>
<025C> <025C> <045D>
<025D> <027F> <0460>
<0280> <0287> <0488>
<0288> <02F5> <0492>
<02F6> <02FF> <0500>
<0300> <0309> <050A>
<030A> <035B> <1EA0>
<035C> <0361> <1EF4>
<0362> <0362> <20AB>
<036D> <036E> <0162>
<036F> <0372> <01EA>
<0373> <0373> <0259>
<0374> <0374> <0309>
<0375> <0375> <1F4D>
<0376> <0376> <1FDE>
<0377> <0377> <2070>
<0378> <0378> <2076>
<0379> <0379> <2079>
<038A> <038E> <FB00>
<038F> <038F> <1E9E>
<0390> <0391> <A7B3>
<03AF> <03AF> <0131>
<03B0> <03B0> <0237>
<03B1> <03B1> <A7B5>
endbfrange

35 beginbfrange
<03B2> <03B2> <AB53>
<03C1> <03C8> <2095>
<03C9> <03E3> <05D0>
<03E4> <03F0> <FB2A>
<03F1> <03F5> <FB38>
<03F6> <03F6> <FB3E>
<03F7> <03F8> <FB40>
<03F9> <03FA> <FB43>
<03FB> <03FF> <FB46>
<0400> <0400> <FB4B>
<0401> <0405> <0300>
<0406> <0408> <0306>
<0409> <040B> <030A>
<040C> <040C> <030F>
<040D> <040D> <0312>
<040E> <040E> <0323>
<040F> <0410> <0327>
<0411> <0412> <0485>
<0413> <0414> <0483>
<0415> <0422> <05B0>
<0423> <0424> <05C1>
<0425> <0425> <05C7>
<0459> <0462> <2080>
<0463> <0463> <05BE>
<0464> <0464> <207D>
<0465> <0465> <208D>
<0466> <0466> <207E>
<0467> <0467> <208E>
<0468> <0468> <207A>
<0469> <0469> <207C>
<046A> <046A> <208A>
<046B> <046B> <208C>
<046C> <046C> <2215>
<046D> <046D> <20AA>
<046E> <046E> <2120>
endbfrange

endcmap
CMapName currentdict /CMap defineresource pop
end
end

-- 
*Gino*

Re: Fwd: Help with Incorrect Identity-H Mapping

Posted by Tilman Hausherr <TH...@t-online.de>.
Also try changing the line

    cs.showText("äöüß");

to

    String s = "äöüß";
    System.out.println(s.length());
    cs.showText(s);

the output on the console should be 4. If suspect your output will be 8 
if my theory is correct.

Tilman


Re: Fwd: Help with Incorrect Identity-H Mapping

Posted by Tilman Hausherr <TH...@t-online.de>.
Hello Gino,

Please tell whether it happens with every font or only with that one. 
And check whether the encoding in the source code is the same passed to 
the javac compiler. I suspect your file is UTF8 but the java compiler 
expects a single byte font.

It works for me, I just tested it:

     public static void main(String[] args) throws IOException
     {
         try (PDDocument doc = new PDDocument())
         {
             PDFont font = PDType0Font.load(doc, new 
FileInputStream("XXXX/OpenSans-Regular.ttf"), false);
             PDPage page = new PDPage();
             doc.addPage(page);
             try (PDPageContentStream cs = new PDPageContentStream(doc, 
page))
             {
                 cs.setFont(font, 20);
                 cs.beginText();
                 cs.newLineAtOffset(50, 650);
                 cs.showText("äöüß");
                 cs.endText();
             }
             doc.save("XXXX/gino.pdf");
         }
     }

And this is the content stream:

/F1 20 Tf
BT
   50 650 Td
   (\000\246\000\270\000\276\000\241) Tj
ET

Tilman

On 30.01.2024 15:52, Gino G wrote:
> Hello there,
>
> I'm encountering an error in how certain characters are encoded using 
> PDFBox. The issue exists in all versions of PDFBox, but I'm currently 
> using 3.0.1.
>
> contentStream.showText("äöüß");
>
> The string "äöüß" is used as a test for Unicode characters that PDFBox 
> needs to render.
>
> var resource = 
> Processor.class.getResource("/OpenSans-Regular.ttf");var file = 
> Paths.get(resource.toURI()).toFile(); vartargetStream = new 
> FileInputStream(file); var out = 
> PDType0Font.load(PageAssembler.getDocument(), targetStream, false); 
> contentStream.setFont(out, 20);
>
> To do so, I'm importing a font that I know has the glyphs for all four 
> special characters (OpenSans downloaded from Google Fonts).
> However, this issue can be reproduced using any other 
> Unicode-supported font.
>
> Executing the code, PDFBox renders the following character 
> sequence: Ã¤Ã¶Ã¼ÃŸ.
> Clearly an encoding issue.
>
> Using the PDF Debugger, it shows the text rendered as:
>
> /F1 20 Tf
> BT
>   (\000\205\000f\000\205\000x\000\205\000~\000\205\0019) Tj
> ET
>
> Now, as far as I understand from what I've learned while debugging 
> this issue, \205 is the octal value that uses the glyph at position 
> 133 (decimal for \205) of the font with the id F1.
> Again, looking at the F1 section in the PDF Debugger, the character 
> listed under the code / CID / GID 133 is indeed Ã, the first 
> "incorrect" character of the sequence, which is supposed to be "ä"
> "ä", however, would be 166, not 133. How does PDFBox get this wrong?
>
> As an aside, if I use showText and use toUnicode(166), PDFBox 
> correctly renders "ä" in the desired font!
>
> Looking at the "ToUnicode" part of the F1 font, the following string 
> is displayed.
>
> Could someone please help me figure out what is going on? And 
> hopefully even help me fix this issue? For more help, I have attached 
> the PDF document.
>
> Best,
> Gino
>
> ToUnicode:
>
> /CIDInit /ProcSet findresource begin
> 12 dict begin
>
> begincmap
> /CIDSystemInfo
> << /Registry (Adobe)
> /Ordering (UCS)
> /Supplement 0
> >> def
>
> /CMapName /Adobe-Identity-UCS def
> /CMapType 2 def
>
> 1 begincodespacerange
> <0000> <FFFF>
> endcodespacerange
>
> 100 beginbfrange
> <0001> <0001> <0000>
> <0002> <0002> <000D>
> <0003> <0061> <0020>
> <0062> <00C1> <00A0>
> <00C2> <00F2> <0100>
> <00F3> <00FF> <0132>
> <0100> <0122> <013F>
> <0123> <0124> <021A>
> <0125> <0140> <0164>
> <0141> <0141> <0192>
> <0142> <0147> <01FA>
> <0148> <0149> <0218>
> <014A> <014B> <02C6>
> <014C> <014C> <02C9>
> <014D> <0152> <02D8>
> <0153> <0159> <0384>
> <015A> <015A> <038C>
> <015B> <016E> <038E>
> <016F> <019A> <03A3>
> <019B> <01A6> <0401>
> <01A7> <01E8> <040E>
> <01E9> <01F4> <0451>
> <01F5> <01F6> <045E>
> <01F7> <01F8> <0490>
> <01F9> <01FE> <1E80>
> <01FF> <01FF> <1EF2>
> <0200> <0200> <1EF3>
> <0201> <0203> <2013>
> <0204> <020B> <2017>
> <020C> <020E> <2020>
> <020F> <020F> <2026>
> <0210> <0210> <2030>
> <0211> <0212> <2032>
> <0213> <0214> <2039>
> <0215> <0215> <203C>
> <0216> <0216> <2044>
> <0217> <0217> <207F>
> <0218> <0219> <20A3>
> <021A> <021A> <20A7>
> <021B> <021B> <20AC>
> <021C> <021C> <2105>
> <021D> <021D> <2113>
> <021E> <021E> <2116>
> <021F> <021F> <2122>
> <0220> <0220> <2126>
> <0221> <0221> <212E>
> <0222> <0225> <215B>
> <0226> <0226> <2202>
> <0227> <0227> <2206>
> <0228> <0228> <220F>
> <0229> <022A> <2211>
> <022B> <022B> <221A>
> <022C> <022C> <221E>
> <022D> <022D> <222B>
> <022E> <022E> <2248>
> <022F> <022F> <2260>
> <0230> <0231> <2264>
> <0232> <0232> <25CA>
> <0235> <0235> <0326>
> <0237> <0238> <2074>
> <0239> <023A> <2077>
> <023B> <0246> <2000>
> <0247> <0247> <FEFF>
> <0248> <0249> <FFFC>
> <024A> <024A> <01F0>
> <024B> <024B> <02BC>
> <024C> <024D> <03D1>
> <024E> <024E> <03D6>
> <024F> <0250> <1E3E>
> <0251> <0252> <1E00>
> <0253> <0253> <02F3>
> <0254> <0255> <01A0>
> <0256> <0257> <01AF>
> <0259> <0259> <0400>
> <025A> <025A> <040D>
> <025B> <025B> <0450>
> <025C> <025C> <045D>
> <025D> <027F> <0460>
> <0280> <0287> <0488>
> <0288> <02F5> <0492>
> <02F6> <02FF> <0500>
> <0300> <0309> <050A>
> <030A> <035B> <1EA0>
> <035C> <0361> <1EF4>
> <0362> <0362> <20AB>
> <036D> <036E> <0162>
> <036F> <0372> <01EA>
> <0373> <0373> <0259>
> <0374> <0374> <0309>
> <0375> <0375> <1F4D>
> <0376> <0376> <1FDE>
> <0377> <0377> <2070>
> <0378> <0378> <2076>
> <0379> <0379> <2079>
> <038A> <038E> <FB00>
> <038F> <038F> <1E9E>
> <0390> <0391> <A7B3>
> <03AF> <03AF> <0131>
> <03B0> <03B0> <0237>
> <03B1> <03B1> <A7B5>
> endbfrange
>
> 35 beginbfrange
> <03B2> <03B2> <AB53>
> <03C1> <03C8> <2095>
> <03C9> <03E3> <05D0>
> <03E4> <03F0> <FB2A>
> <03F1> <03F5> <FB38>
> <03F6> <03F6> <FB3E>
> <03F7> <03F8> <FB40>
> <03F9> <03FA> <FB43>
> <03FB> <03FF> <FB46>
> <0400> <0400> <FB4B>
> <0401> <0405> <0300>
> <0406> <0408> <0306>
> <0409> <040B> <030A>
> <040C> <040C> <030F>
> <040D> <040D> <0312>
> <040E> <040E> <0323>
> <040F> <0410> <0327>
> <0411> <0412> <0485>
> <0413> <0414> <0483>
> <0415> <0422> <05B0>
> <0423> <0424> <05C1>
> <0425> <0425> <05C7>
> <0459> <0462> <2080>
> <0463> <0463> <05BE>
> <0464> <0464> <207D>
> <0465> <0465> <208D>
> <0466> <0466> <207E>
> <0467> <0467> <208E>
> <0468> <0468> <207A>
> <0469> <0469> <207C>
> <046A> <046A> <208A>
> <046B> <046B> <208C>
> <046C> <046C> <2215>
> <046D> <046D> <20AA>
> <046E> <046E> <2120>
> endbfrange
>
> endcmap
> CMapName currentdict /CMap defineresource pop
> end
> end
>
> -- 
> /*Gino*/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail:users-help@pdfbox.apache.org