You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/07/28 15:14:39 UTC
[jira] [Created] (PDFBOX-2247) Regression in text extraction
between 1.8.5 and 1.8.6
Tim Allison created PDFBOX-2247:
-----------------------------------
Summary: Regression in text extraction between 1.8.5 and 1.8.6
Key: PDFBOX-2247
URL: https://issues.apache.org/jira/browse/PDFBOX-2247
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.8.6
Reporter: Tim Allison
Priority: Minor
Looks like a character mapping issue crept in some time between 1.8.5 and 1.8.6 on this [file|http://digitalcorpora.org/corp/nps/files/govdocs1/701/701542.pdf]?
With both seq and NonSeq parsers, the correct text was extracted via ExtractText in 1.8.5. In 1.8.6, java -jar pdfbox-app-1.8.6.jar ExtractText yields text starting with: {noformat}7>PFLK>I 9>NH ;BNRF@B
=%;% .BM>NPJBKP LC PEB 3KPBNFLN
9>@FCF@ -L>OP ;@FBK@B >KA 5B>NKFKD -BKPBN
:BOB>N@E 9NLGB@P ;QJJ>NT .B@BJ?BN (&&*
"&++&,-+Æ$( #&+-&%+$-& !).&)-*+Æ&,{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)