You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/17 09:04:44 UTC
[jira] Commented: (PDFBOX-729) Text extracted from a TeX-created
PDF file is unintelligible, but not of the form a1a2a3...
[ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868106#action_12868106 ]
Andreas Lehmkühler commented on PDFBOX-729:
-------------------------------------------
Some of the text uses a Type3 font which can't be extracted because of the used glyphs.
> Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...
> -------------------------------------------------------------------------------------------
>
> Key: PDFBOX-729
> URL: https://issues.apache.org/jira/browse/PDFBOX-729
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.1.0
> Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
> Reporter: Thomas Fischer
> Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.