You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 03:21:33 UTC

[jira] [Closed] (PDFBOX-729) Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)

     [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-729.
------------------------------
    Resolution: Won't Fix

On the other hand, some Type 3 fonts do contain useful text, which is why Acrobat attempts to extract text from them. This is a good call, and PDFBox should continue to do the same thing.

> Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)