You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by sunny hisa <su...@yahoo.com.INVALID> on 2017/05/12 15:00:45 UTC

Getting lots of warnings "No Unicode mapping for..." when extract text

When I use PDFbox to extract text, I get lots of warnings and as output I only get garbage. But when I use Abode Acrobat to export the attached PDF file to text, it works fine. I have attached the original PDF file, the text output and the log with warnings. And besides, 
PDF file seems to have a Type-1 font embedded with a custom encoding.
The PDFbox version is pdfbox-app-2.0.5
The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText FileWithIssue.pdf
I have checked lots of reports on JIRA issue tracker, still find no way to solve it.I am looking forward to hearing from you.

Thanks & Best RegardsSunny Xia


Re: Getting lots of warnings "No Unicode mapping for..." when extract text

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.05.2017 um 17:00 schrieb sunny hisa:
> When I use PDFbox to extract text, I get lots of warnings and as 
> output I only get garbage. But when I use Abode Acrobat to export the 
> attached PDF file to text, it works fine.

No, it doesn't work fine, here is what I get with Adobe Reader:


  
ATTENTION
􀀀 􀀀􀀀    􀀀􀀀
􀀀
􀀀􀀀􀀀
􀀀􀀀!􀀀􀀀
 􀀀
"

􀀀!"&􀀀!" #"!􀀀􀀀"􀀀􀀀􀀀"􀀀% 􀀀!􀀀
"􀀀#"􀀀 􀀀$􀀀􀀀" 􀀀





> I have attached the original PDF file, the text output and the log 
> with warnings. And besides,
> PDF file seems to have a Type-1 font embedded with a custom encoding.

The PDF didn't get through, you should have uploaded it to a 
sharehoster. I accessed it because I'm a moderator.


>
> The PDFbox version is pdfbox-app-2.0.5
> The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText 
> FileWithIssue.pdf
>
> I have checked lots of reports on JIRA issue tracker, still find no 
> way to solve it.I am looking forward to hearing from you.

See here:  https://pdfbox.apache.org/2.0/faq.html#gibberish

The problem with your file is that it uses incorrect glyph names in the 
/Differences table, like "C0046" for a ".", or "C0065" for an "A".

Changing that in the source code brings this output:


Preface
ATTENTION
Personnel,  accessing    Rack  equipment  described  in
this  document,  should  be  familiar  with  and  observe  Safety
instructions.
The  safety  instructions  and  the  meaning  of  the  warning labels  on
the  equipment  are  given  in    1.


This is still not complete, APOLT is missing (no idea why) and there are 
NUL characters (which are in the PDF too).


Tilman

>
>
> Thanks & Best Regards
> Sunny Xia
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org