You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Oleksii Zinkovskyi (JIRA)" <ji...@apache.org> on 2017/12/15 20:59:00 UTC

[jira] [Resolved] (PDFBOX-4036) Invalid ToUnicode CMap in font

     [ https://issues.apache.org/jira/browse/PDFBOX-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Oleksii Zinkovskyi resolved PDFBOX-4036.
----------------------------------------
    Resolution: Not A Bug

Thank you for the quick response. A faulty file was my primary idea as well, but since working with this particular file is a major business requirement in my current project I wanted to be completely sure that the issue was the file itself and not something I was doing or the tools I was using. I believe you've addressed all of my concerns, thank you.

> Invalid ToUnicode CMap in font
> ------------------------------
>
>                 Key: PDFBOX-4036
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4036
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.4, 2.0.8
>         Environment: Windows 10 64 bit, STS 3.9.1, JDK 1.8.0_152, Gradle
>            Reporter: Oleksii Zinkovskyi
>         Attachments: CSTA17.pdf, PDFBOX-4036-reduced.pdf
>
>
> While calling textStripper.getText(document) on the attached PDF file to extract text and save it to .txt, I receive following warnings:
> {quote}Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:22 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font UYQXWX+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+380 (380) in font FANHRS+MaterialIcons-Regular
> Dec 15, 2017 8:53:25 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
> WARNING: No Unicode mapping for CID+381 (381) in font FANHRS+MaterialIcons-Regular{quote}
> In the end the file is generated and properly saved, but some letters are missing (like "ft" in "software" or "ff" in "different"). So far I've tested close to 10 files and this is the only problematic item I've found. Depending on what program I use to view the resulting .txt file, I either get blank spaces (Notepad) or "NUL" values (Notepad++) in place of the missing letters. What's more, some editors (Sublime Text Editor) outright refuse to open the file and view it as unreadable/corrupted byte code. Suffice to say working with such a file is somewhat difficult...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org