You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/16 19:12:42 UTC

[jira] Resolved: (PDFBOX-534) PDF file created with LaTeX is bad parsed

     [ https://issues.apache.org/jira/browse/PDFBOX-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-534.
---------------------------------------

    Fix Version/s: 1.2.0
       Resolution: Fixed

I've found a solution for Ernestos pdf and committed it with version 944875.

There are 3 possible reasons for the described issues with pdfs created with tex/latex

- using an unknown glyphname
- using a hexadecimal (starting with a 'x' e.g. xF6) encoding instead of a glypname or a hexadecimal unicode
- using a decimal (starting with an 'a' e.g. a128) encoding instead of a glypname or a hexadecimal unicode

I've created an addtional glyph-mapping file to adress the unknown glyphnames and extended Encoding.getCharacter() to handle the hexadecimal and the decimal encodings

Thomas pdf uses Type3 fonts with Glyphs which can'be extracted even with the Acrobat Reader.


> PDF file created with LaTeX is bad parsed
> -----------------------------------------
>
>                 Key: PDFBOX-534
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-534
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux/Ubuntu 9
>            Reporter: Ernesto De Santis
>             Fix For: 1.2.0
>
>         Attachments: amapn19_03.pdf, amapn19_03.txt, kvfs-PDFKit.txt, kvfs.pdf, kvfs.txt, kvfs_r944875.txt
>
>
> I'm getting an unexpected behavior parsing a pdf file.
> I'm trying to get the clean body text of some file, and I get a lot of aXX strings. Where each X is a number. It appear be the char code of the real character, I don't know really.
> My code is too simple:
>           String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
>           ExtractText.main(args);
> I used the PDFBox 0.8.0-incubator version. Builded on 20/9/2009. 
> The output I get is:
> a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 a115a105a115a116a101a109a97a115 a100a101
> a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115 a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
> a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
> and more ......
> The pdf file was generated by pdflatex command, in Ubuntu 9.
> The pdf properties are:
> producer: pdfTeX-1.40.3
> format: PDF-1.4
> security: NO
> optimized: NO
> paper: A4, vertical (210 x 297 mm)
> When I run the PDFBox test, I get this by the console:
> 0 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: d
> INFO  [main]: unsupported/disabled operation: d
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: J
> INFO  [main]: unsupported/disabled operation: J
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: m
> INFO  [main]: unsupported/disabled operation: m
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: l
> INFO  [main]: unsupported/disabled operation: l
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: S
> INFO  [main]: unsupported/disabled operation: S
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: re
> INFO  [main]: unsupported/disabled operation: re
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: f
> INFO  [main]: unsupported/disabled operation: f
> 1274 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: rg
> INFO  [main]: unsupported/disabled operation: rg
> 1275 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: RG
> INFO  [main]: unsupported/disabled operation: RG
> 1536 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: f*
> INFO  [main]: unsupported/disabled operation: f*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.