You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/06/25 06:24:00 UTC
[jira] [Comment Edited] (PDFBOX-4250) PDF File with embedded fonts:
text extraction fails or returns junk characters
[ https://issues.apache.org/jira/browse/PDFBOX-4250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16521611#comment-16521611 ]
Tilman Hausherr edited comment on PDFBOX-4250 at 6/25/18 6:23 AM:
------------------------------------------------------------------
Weird… Maybe I deleted it accidentally, this is the kind of questions where I answer. I'll answer something tomorrow, in the meantime, read this:
[https://pdfbox.apache.org/2.0/faq.html#text-extraction]
I also had a quick look... the fonts in your file don't have a ToUnicode stream. (Have a look at your file with PDFDebugger and look at the fonts)
was (Author: tilman):
Weird… Maybe I deleted it accidentally, this is the kind of questions where I answer. I'll answer something tomorrow, in the meantime, read this:
[https://pdfbox.apache.org/2.0/faq.html#text-extraction]
I also had a quick look... the fonts in your file doesn't have a ToUnicode stream. (Have a look at your file with PDFDebugger and look at the fonts)
> PDF File with embedded fonts: text extraction fails or returns junk characters
> ------------------------------------------------------------------------------
>
> Key: PDFBOX-4250
> URL: https://issues.apache.org/jira/browse/PDFBOX-4250
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.9
> Reporter: Bob Swanson
> Priority: Major
>
> One of the people that I support created a PDF file from an LibreOffice document, and then misplaced the original document. I believed that I could use PDFBox to extract the text from the PDF, and at least provide that information to the user.
>
When I ran the text extractor from the "app" jar, on their PDF file I got the following types of messages (many):
>
...
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> WARNING: No Unicode mapping for 7 (7) in
> font EXIRGE+Ubuntu
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimpleont toUnicode
> WARNING: No Unicode mapping for 8 (8) in
> font EXIRGE+Ubuntu
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> WARNING: No Unicode mapping for 1 (1) in
> font JTPICY+AndaleMono
> Jun 13, 2018 5:38:43 PM
> org.apache.pdfbox.pdmodel.font.PDSimple
> ont toUnicode
> ...
>
The resulting "txt" file is just binary numbers, unless the font is one of the "standard". I ran
> the debugger on the PDF file and saw that several fonts were embedded, and thus used low numbers for encoding (1,2,3, etc).
>
When viewed, the PDF file looks good, but nothing can be copied or pasted from the display (again,standard font seems OK).
>
The original file was of a sensitive nature, so I was able to re-create the problem with a simpler file.
>
Running on Ubuntu 16.04
> LibreOffice was used to "print" on the cups-pdf "printer" (which may be part of the problem).
>
Text extract was attempted with pdfbox-app-2.0.9.jar
>
PDF file is at:
>
http://swansongrp.com/misc/mytest3.pdf
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org