You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/17 09:14:43 UTC
[jira] Updated: (PDFBOX-729) Disable text extraction whne using
type3 fonts (was: Text extracted from a TeX-created PDF file is
unintelligible, but not of the form a1a2a3...)
[ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler updated PDFBOX-729:
--------------------------------------
Summary: Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...) (was: Disable text extraction hwne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...))
> Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-729
> URL: https://issues.apache.org/jira/browse/PDFBOX-729
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Affects Versions: 1.1.0
> Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
> Reporter: Thomas Fischer
> Priority: Minor
> Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.