You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/12/03 17:40:36 UTC
[jira] [Updated] (PDFBOX-1793) Failure to extract custom encoded
text
[ https://issues.apache.org/jira/browse/PDFBOX-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated PDFBOX-1793:
--------------------------------
Attachment: gaat fout.txt
gaat fout.pdf
Source pdf and .txt version of output from pdftotext. Source file provided as part of TIKA-1199.
> Failure to extract custom encoded text
> --------------------------------------
>
> Key: PDFBOX-1793
> URL: https://issues.apache.org/jira/browse/PDFBOX-1793
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.3
> Reporter: Tim Allison
> Priority: Minor
> Attachments: gaat fout.pdf, gaat fout.txt
>
>
> PDFBox extracts a binary garble from this file. Adobe Reader does the same. Linux's pdftotext extracts text fairly well. I suspect there's a custom font/encoding node that isn't being processed, but I could be wrong.
--
This message was sent by Atlassian JIRA
(v6.1#6144)