You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Thomas Fischer (JIRA)" <ji...@apache.org> on 2010/05/16 14:28:45 UTC

[jira] Created: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...
-------------------------------------------------------------------------------------------

                 Key: PDFBOX-729
                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.1.0
         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
            Reporter: Thomas Fischer


Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like

CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
C

Only rarely some bits and pieces of recognisable formulas are interspersed.

The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-729) Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-729:
--------------------------------------

    Summary: Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)  (was: Disable text extraction hwne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...))

> Disable text extraction whne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

Posted by "Thomas Fischer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Fischer updated PDFBOX-729:
----------------------------------

    Attachment: wias_preprints_1427.pdf
                wias_preprints_1427.txt

A PDF file creating an unintelligible text, included in the second file.

> Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-729) Disable text extraction hwne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-729:
--------------------------------------

       Summary: Disable text extraction hwne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)  (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)
    Issue Type: Improvement  (was: Bug)
      Priority: Minor  (was: Major)

If Type3 fonts are used within a document we should skip the extraction of those text parts to avoid a scrambled output .

> Disable text extraction hwne using type3 fonts (was: Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...)
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>            Priority: Minor
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-729) Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868106#action_12868106 ] 

Andreas Lehmkühler commented on PDFBOX-729:
-------------------------------------------

Some of the text uses a Type3 font which can't be extracted because of the used glyphs.

> Text extracted from a TeX-created PDF file is unintelligible, but not of the form a1a2a3...
> -------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-729
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-729
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Mac OS X 10.6.3, using org.apache.pdfbox.ExtractText -encoding UTF-8
>            Reporter: Thomas Fischer
>         Attachments: wias_preprints_1427.pdf, wias_preprints_1427.txt
>
>
> Text extracted from some PDF files is completely unintelligible, presumably depending on the software used to create the file. In this example, a combination of dvips(k) 5.95a Copyright 2005 Radical Eye Software (to create PostScript) and Acrobat Distiller 8.1.0 (Windows) (to create the PDF file) was used. The text extracted looks like
> CFCTCXCTD6D7D8D6CPH3B9C1D2D7D8CXD8D9D8
> CUH0D6 BTD2CVCTDBCPD2CSD8CT BTD2CPD0DDD7CXD7 D9D2CS CBD8D3CRCWCPD7D8CXCZ
> CXD1 BYD3D6D7CRCWD9D2CVD7DACTD6CQD9D2CS BUCTD6D0CXD2 CTBACEBA
> C
> Only rarely some bits and pieces of recognisable formulas are interspersed.
> The text copied using either Acrobat Reader or Preview looks different, but is similarly unintelligible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.