You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/17 09:04:44 UTC

[jira] Commented: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

    [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868106#action_12868106 ] 

Andreas Lehmkühler commented on PDFBOX-729:
-------------------------------------------

Some of the text uses a Type3 font which can't be extracted because of the used glyphs.

> Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.