You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by sunny hisa <su...@yahoo.com.INVALID> on 2017/05/12 15:00:45 UTC
Getting lots of warnings "No Unicode mapping for..." when extract
text
When I use PDFbox to extract text, I get lots of warnings and as output I only get garbage. But when I use Abode Acrobat to export the attached PDF file to text, it works fine. I have attached the original PDF file, the text output and the log with warnings. And besides,
PDF file seems to have a Type-1 font embedded with a custom encoding.
The PDFbox version is pdfbox-app-2.0.5
The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText FileWithIssue.pdf
I have checked lots of reports on JIRA issue tracker, still find no way to solve it.I am looking forward to hearing from you.
Thanks & Best RegardsSunny Xia
Re: Getting lots of warnings "No Unicode mapping for..." when extract
text
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.05.2017 um 17:00 schrieb sunny hisa:
> When I use PDFbox to extract text, I get lots of warnings and as
> output I only get garbage. But when I use Abode Acrobat to export the
> attached PDF file to text, it works fine.
No, it doesn't work fine, here is what I get with Adobe Reader:
ATTENTION
!
"
!"&!" #"!""% !
"#" $"
> I have attached the original PDF file, the text output and the log
> with warnings. And besides,
> PDF file seems to have a Type-1 font embedded with a custom encoding.
The PDF didn't get through, you should have uploaded it to a
sharehoster. I accessed it because I'm a moderator.
>
> The PDFbox version is pdfbox-app-2.0.5
> The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText
> FileWithIssue.pdf
>
> I have checked lots of reports on JIRA issue tracker, still find no
> way to solve it.I am looking forward to hearing from you.
See here: https://pdfbox.apache.org/2.0/faq.html#gibberish
The problem with your file is that it uses incorrect glyph names in the
/Differences table, like "C0046" for a ".", or "C0065" for an "A".
Changing that in the source code brings this output:
Preface
ATTENTION
Personnel, accessing Rack equipment described in
this document, should be familiar with and observe Safety
instructions.
The safety instructions and the meaning of the warning labels on
the equipment are given in 1.
This is still not complete, APOLT is missing (no idea why) and there are
NUL characters (which are in the PDF too).
Tilman
>
>
> Thanks & Best Regards
> Sunny Xia
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org