You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Dan Dorazio (JIRA)" <ji...@apache.org> on 2017/01/13 21:37:26 UTC

[jira] [Commented] (PDFBOX-3438) only garbage extracted, lots of warnings "No Unicode mapping..."

    [ https://issues.apache.org/jira/browse/PDFBOX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822384#comment-15822384 ] 

Dan Dorazio commented on PDFBOX-3438:
-------------------------------------

Hi all - 

I read the most recent response from 7.27.16, having to do with a bug in Distiller. However, I have a document created in 06' that has the same symptom. The text extraction occurs and the output is only garbage. Do you have an idea if the Distiller bug referenced above could be an issue at that time as well?

We are performing the extraction using the latest version of Apache Tika (1.14), which includes (and uses) PDFBOX 2.0.3. Unfortunately, I cannot share the document as it contains sensitive information. I'd be interested in the attached patch, but not sure how I'd implement it, given our use of Tika. I suppose I could try it outside of Tika and see if the result improves. Any other ideas on a workaround?

Thanks,
Dan

> only garbage extracted, lots of warnings "No Unicode mapping..."
> ----------------------------------------------------------------
>
>                 Key: PDFBOX-3438
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3438
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.2
>            Reporter: Oliver Steinau
>         Attachments: PDFBOX-3438.diff, PDFBOX-3438.txt, test.pdf
>
>
> When I try to extract text from this PDF, I get lots of warnings "No Unicode mapping for ...", and as output I only get garbage.
> PDF file displays fine in Acrobat Reader, and pdftotext.exe will extract the text just fine.
> PDF file seems to have a Type-1 font embedded with a custom encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org