You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/12/03 17:40:36 UTC

[jira] [Updated] (PDFBOX-1793) Failure to extract custom encoded text

     [ https://issues.apache.org/jira/browse/PDFBOX-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated PDFBOX-1793:
--------------------------------

    Attachment: gaat fout.txt
                gaat fout.pdf

Source pdf and .txt version of output from pdftotext.  Source file provided as part of TIKA-1199.

> Failure to extract custom encoded text
> --------------------------------------
>
>                 Key: PDFBOX-1793
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1793
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: gaat fout.pdf, gaat fout.txt
>
>
> PDFBox extracts a binary garble from this file.  Adobe Reader does the same.  Linux's pdftotext extracts text fairly well.  I suspect there's a custom font/encoding node that isn't being processed, but I could be wrong.



--
This message was sent by Atlassian JIRA
(v6.1#6144)