You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/07/28 15:14:39 UTC

[jira] [Created] (PDFBOX-2247) Regression in text extraction between 1.8.5 and 1.8.6

Tim Allison created PDFBOX-2247:
-----------------------------------

             Summary: Regression in text extraction between 1.8.5 and 1.8.6
                 Key: PDFBOX-2247
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2247
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
            Reporter: Tim Allison
            Priority: Minor


Looks like a character mapping issue crept in some time between 1.8.5 and 1.8.6 on this [file|http://digitalcorpora.org/corp/nps/files/govdocs1/701/701542.pdf]? 

With both seq and NonSeq parsers, the correct text was extracted via ExtractText in 1.8.5.  In 1.8.6, java -jar pdfbox-app-1.8.6.jar ExtractText yields text starting with: {noformat}7>PFLK>I 9>NH ;BNRF@B
=%;% .BM>NPJBKP LC PEB 3KPBNFLN
9>@FCF@ -L>OP ;@FBK@B >KA 5B>NKFKD -BKPBN
:BOB>N@E 9NLGB@P ;QJJ>NT .B@BJ?BN (&&*
"&++&,-+Æ$( #&+-&%+$-& !).&)-*+Æ&,{noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)