You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Ian Smith <Ia...@gossinteractive.com> on 2010/03/09 12:05:12 UTC

Odd characters from text extraction

Hi Folks,

I have linked to a PDF (~2MB) that produces unprintable characters in
the extracted text output.  These characters seem to be associated with
the first two pages of the document.

http://www.yourphp.org.uk/media/pdf/g/4/Annual_Report_0809.pdf

I believe the problem is caused by at least one of the embedded fonts in
the document; my debugging has shown that the strange characters are
associated with Identity-H encoding and/or Type 1 (CID) fonts and (only
perhaps) also the Mistral Font (KWTOGC+Mistral?).  Fonts that display
correctly seem to be associated with the WinAnsi encoding.

I have not been able to debug further owing to the large number of
deeply nested PDF objects (I don't really know anything about PDF!).
Hope this is the right place to report this, if not then please let me
know.

Regards,

Ian Smith.



Free User Group in Bristol on 11th March. More info here www.gossinteractive.com/usergroupmar10 

Web design and Content Management. www.twitter.com/gossinteractive 
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.