You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Thomas Fischer <fi...@aon.at> on 2010/05/04 16:57:00 UTC

Unsatisfactory decoding in some pdf documents

Hello,

I have extracted text from a series of mathematical documents (articles, dissertations, books). The result seems to be OK for the majority, but some texts come out unintelligibly. All characters are somehow decoded, I could distinguish at least 3 versions. In those papers all characters look like the following examples:

1. x57x65x69x65x72x73x74x72x61xffx2dx49x6ex73x74x69x74x75x74
2. a0a2a1a4a3a6a5a8a7a10a9a12a11a14a13a15a9a16a13a15a11
3. BYCXD2CPD2CRCXCPD0 BWCTD6CXDACPD8CXDACTD7

Using Apple's PDF kit, I obtain readable results for the first case:
Weierstraÿ-Institut für Angewandte Analysis und Stochastik
Preprint
ISSN 0946 􏴏 8633
…

Can anybody tell me what this means, is there a way to improve the results?

Best regards
Thomas Fischer