You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "chunlinyao (JIRA)" <ji...@apache.org> on 2019/06/13 22:18:00 UTC

[jira] [Comment Edited] (PDFBOX-4570) U+2225 rendered as U+2016 glyph when use UniJIS-UCS2-H and non embedded font

    [ https://issues.apache.org/jira/browse/PDFBOX-4570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863504#comment-16863504 ] 

chunlinyao edited comment on PDFBOX-4570 at 6/13/19 10:17 PM:
--------------------------------------------------------------

Yes, MS PMincho changed to diagonal lines since windows vista. The old (CP932) mapping U+2016 and U+2225 to same J+8161 code. So old MS PMincho contains only one glyph for U+2225. and adobe UniJIS-UCS2-H mapping U+2016 and U+2225 to CID 666.

JIS_X0213 mapping U+2016 to J+8161 and U+2225 to J+81d2 and MS PMincho contains two glyphs. adobe UniJIS-UTF16-H also changed to map U_2016 to 666 and U+2225 to 15489.

{code:bash}
$ echo "0000 2016 2225" | xxd -r |iconv -f utf-16be -t cp932 |xxd
00000000: 8161 8161                                .a.a
$ echo "0000 2016 2225" | xxd -r |iconv -f utf-16be -t shift_jisx0213 |xxd
00000000: 8161 81d2                                .a..
{code}

If users really require U+2225, we should suggest they change to UniJIS-UTF16-H or embed the fonts.
It seems adobe reader bypassed the cmap, maybe they use the code from document to lookup glyph directly when they known the source encoding is unicode and font should lookup by unicode glyph name.


was (Author: chunlinyao):
Yes, MS PMincho changed to diagonal lines since windows vista. The old (CP932) mapping U+2016 and U+2225 to same J+8161 code. So old MS PMincho contains only one glyph for U+2225. and adobe UniJIS-UCS2-H mapping U+2016 and U+2225 to CID 666.

JIS_X0213 mapping U+2016 to J+8161 and U+2225 to J+81d2 and MS PMincho contains two glyphs. adobe UniJIS-UTF16-H also changed to map U_2016 to 666 and U+2225 to 15489.

If users really require U+2225, we should suggest they change to UniJIS-UTF16-H or embed the fonts.
It seems adobe reader bypassed the cmap, maybe they use the code from document to lookup glyph directly when they known the source encoding is unicode and font should lookup by unicode glyph name.

> U+2225 rendered as U+2016 glyph when use UniJIS-UCS2-H and non embedded font
> ----------------------------------------------------------------------------
>
>                 Key: PDFBOX-4570
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4570
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: FontBox
>    Affects Versions: 2.0.15
>         Environment: Windows 10 64bit, Adobe Reader 2019.012.20034
>            Reporter: chunlinyao
>            Priority: Minor
>         Attachments: correct.png, incorrect.png, u2225.pdf, u2225.png
>
>
> Maybe this is not a bug of  PDFBox, This pdf rendered difference than adobe reader. it use MS PMincho font, this font has glyph for U+2225, the glyph in Win10 different from WinXP (I confirmed that by using FontForge.)
> The Adobe Reader 2019.012.20034 ON Win10 rendered it correctly. Even Adobe Reader 2019.012.20034 ON macOS rendered incorrect. (with MSPMincho font installed)
> MuPDF 1.6 on Windows, Chrome, FireFox all rendered it like PDFBox. 
> Although Adobe Reader on win10 rendered it correctly, When you copy the text from pdf, you will get U+2016 not U+2225.
> I doubt Adobe Reader doesn't use UniJIS-UCS2-H to convert unicode to cid then convert back to unicode when retrive glyphs.
> The UniJIS-UCS2-H is obsoleted. It mapping both U+2225 and U+2016 to CID+666, Change to UniJIS-UTF16-H can workaround this problem.
> Is there some posibility to improve PDFBox render like Adobe Reader?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org