You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2021/11/20 12:07:00 UTC

[jira] [Created] (PDFBOX-5328) Failing to get multiple encodings from cmap table

Tilman Hausherr created PDFBOX-5328:
---------------------------------------

             Summary: Failing to get multiple encodings from cmap table
                 Key: PDFBOX-5328
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5328
             Project: PDFBox
          Issue Type: Bug
          Components: FontBox
    Affects Versions: 2.0.24, 1.8.16
            Reporter: Tilman Hausherr
            Assignee: Tilman Hausherr
             Fix For: 1.8.17, 2.0.25, 3.0.0 PDFBox
         Attachments: NotoSansSC-Regular.otf

As reported by Ty Lewis in the users mailing list, see [here|https://mail-archives.apache.org/mod_mbox/pdfbox-users/202111.mbox/%3CCAPRgSAOG1a9kw4wSmArH0uG-N5xd9_kPq7ju4U%3DSv9H9CQZmcQ%40mail.gmail.com%3E]
{noformat}
Unicode encodings for GID 8712: List(U+f967)
Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 3):
List(U+4e0d, U+f967)
Unicode encodings for GID 8712 from table (platformId = 0 encodingId = 4):
List(U+f967)
{noformat}
I made some java code to reproduce this:
{code}
File fontFile = new File("NotoSansSC-Regular.otf");
OTFParser otfParser = new OTFParser(false);
OpenTypeFont otf = otfParser.parse(fontFile);

CmapLookup unicodeCmapLookup = otf.getUnicodeCmapLookup();
List<Integer> charCodes = unicodeCmapLookup.getCharCodes(8712);
System.out.println(charCodes);

CmapTable cmapTable = otf.getCmap();
CmapSubtable unicodeFullCmapTable = cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE, CmapTable.ENCODING_UNICODE_2_0_FULL);

CmapSubtable unicodeBmpCmapTable = cmapTable.getSubtable(CmapTable.PLATFORM_UNICODE, CmapTable.ENCODING_UNICODE_2_0_BMP);

List<Integer> unicodeBmpCharCodes = unicodeBmpCmapTable.getCharCodes(8712);
List<Integer> unicodeFullCharCodes = unicodeFullCmapTable.getCharCodes(8712);

System.out.println(unicodeBmpCharCodes);
System.out.println(unicodeFullCharCodes);
{code}
A look in the tables with DTL OTMaster 3.7 light shows there are indeed two entries. Its 不 and 不.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org