You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Christian Appl (Jira)" <ji...@apache.org> on 2020/03/05 09:48:00 UTC

[jira] [Comment Edited] (PDFBOX-4793) Questionable fallback font for some embedded chinese fonts

    [ https://issues.apache.org/jira/browse/PDFBOX-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051953#comment-17051953 ] 

Christian Appl edited comment on PDFBOX-4793 at 3/5/20, 9:47 AM:
-----------------------------------------------------------------

[~lehmi] Agreed, the embedded subset should contain all required glyphs - but iText (in case of the problematic document) decided to embed a nonsensical subset. However, as other applications (such as PDF.js and Adobe Reader) do display the PDF just fine, I can not simply dismiss the document as erroneous.

It would be great if the solution was as easy as installing a new font... however, if everything is fine in your tests, then I seem to do something wrong here:

*Further information and following your instructions:*
I am using Windows10 - I have reinstalled ArialUnicodeMS (even though my OS told me, it was already installed.)
 !screenshot-2.png! 

I created a new fresh and empty project (to rule out hickups in our code) and added PDFBox via the following dependency to the project's pom:
 !screenshot-3.png!

I used the PDFRenderer to check whether ArialUnicode would be used to render the image now:
 !screenshot-4.png! 

But still Malgun is used as the fallback font:
 !screenshot-5.png! 

Still leading to the following output:
 !screenshot-6.png!

*Questions:*
What on earth am I doing wrong?
Are the determined fallback fonts cached in some way and is deleting the cache an option?


was (Author: capsvd):
[~lehmi] Agreed, the embedded subset should contain all required glyphs - but iText (in case of the problematic document) decided to embed a nonsensical subset. However, as other applications (such as PDF.js and Adobe Reader) do display the PDF just fine, I can not simply dismiss the document as erroneous.

It would be great if the solution was as easy as installing a new font... however, if everything is fine in your tests, then I seem to do something wrong here:

*Further information and following your instructions:*
I am using Windows10 - I have reinstalled ArialUnicodeMS (even though my OS told me, it was already installed.)
 !screenshot-2.png! 

I created a new fresh and empty project (to rule out hickups in our code) and added PDFBox via the following dependency to the project's pom:
 !screenshot-3.png!

I used the PDFRenderer to check whether ArialUnicode would be used to render the image now:
 !screenshot-4.png! 

But still Malgun is used as the fallback font:
 !screenshot-5.png! 

Still leading to the following output:
 !screenshot-6.png!

What on earth am I doing wrong? Are the determined fallback fonts cached in some way and is deleting the cache an option?

> Questionable fallback font for some embedded chinese fonts
> ----------------------------------------------------------
>
>                 Key: PDFBOX-4793
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4793
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.18
>            Reporter: Christian Appl
>            Priority: Major
>         Attachments: image-2020-03-04-09-49-42-323.png, image-2020-03-04-09-58-01-055.png, image-2020-03-04-10-09-25-343.png, image-2020-03-04-10-31-03-065.png, pdf_font-zhcn.pdf, screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png
>
>
> *Issue:*
> I tried to render PDFs, that contain embedded chinese fonts. Neither the PDF Debugger, nor printouts of the document (PDFPrintable), nor the PDFRenderer can display/render the chinese glyphs correctly and will render placeholders instead.
> *Assumptions:*
> I assume, that said embedded fonts are incomplete and don't contain all glyphs, that would be required to render the text properly and therefore PDFbox attempts to use the previously determined fallback font. (!?)
>  !image-2020-03-04-09-49-42-323.png! 
>  !image-2020-03-04-09-58-01-055.png! 
> And fails to find the glyphs in said fallback font.
> Which is not surprising, as the Fallback font "MalgunGothic-Semilight" (Windows standard font) does not contain chinese characters.
>  !image-2020-03-04-10-09-25-343.png! 
> *Debugging:*
> I tried to understand how the fallback font is determined and what could be done to solve this problem on my end. But I was unable to find a satisfying solution.
> My best guess so far is, that the CIDFontMapping (FontMapperImpl) is to blame for determining an unfit fallback font.
> Although it seems to check, whether required codepages are contained in a fallback font, it still does rank the Malgun font as the topscorer and best substitute font, even though it does clearly not contain all required codepages.
> *My opinion:*
> This is troubling, as better fit fonts exist and could have been selected. (ie.: Adobe Stong Std) And are indeed included in the CIDFontMapping, but seemingly are scoring lower for some reason.
> *Further information:*
> I can not disclose the document in question, however I found a document (pdf_font-zhcn.pdf) in another issue (PDFBOX-3132), that can be used to reproduce the issue (ie.: by dropping it into the PDF Debugger)
>  !image-2020-03-04-10-31-03-065.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org