You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2019/09/10 17:21:00 UTC

[jira] [Commented] (PDFBOX-4647) pdf内嵌字体解析不出来 ABCDEE+Arial 字体

    [ https://issues.apache.org/jira/browse/PDFBOX-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926791#comment-16926791 ] 

Tilman Hausherr commented on PDFBOX-4647:
-----------------------------------------

The chinese translates to "Inline font parsing does not come out".

You're missing the text; this is because the "ToUnicode" mapping is missing in that font. Try with Adobe Reader, you will not be able to extract it. (It is the part with "Boulevard Miguel de Cervantes". The only solution will be OCR, e.g. with Apache Tika and Tesseract.

See also

[https://pdfbox.apache.org/2.0/faq.html#text-extraction]

 

> pdf内嵌字体解析不出来  ABCDEE+Arial 字体
> -----------------------------
>
>                 Key: PDFBOX-4647
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4647
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, PDModel
>    Affects Versions: 2.0.4
>            Reporter: wanling
>            Priority: Major
>         Attachments: 5e214f828f164322a6600f183191dda5.pdf
>
>
> 报错如下:
> OpenType Layout tables used in font ABCDEE+Arial are not implemented in PDFBox and will be ignored;
> No Unicode mapping for CID+24 (24) in font ABCDEE+Arial
> Adode可以正常查看
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org