You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Neil McErlean (JIRA)" <ji...@apache.org> on 2010/06/29 15:37:54 UTC

[jira] Issue Comment Edited: (PDFBOX-569) Text-Extraction of PDF fails

    [ https://issues.apache.org/jira/browse/PDFBOX-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883545#action_12883545 ] 

Neil McErlean edited comment on PDFBOX-569 at 6/29/10 9:37 AM:
---------------------------------------------------------------

[Issue still appears on 1.2.0-SNAPSHOT by the way]

I think this pdf is non-compliant with the standard. Correct me if I'm wrong.

I looked at object "151 0":
151 0 obj
<< 
/Type /Page 
/Parent 318 0 R 
/Contents 152 0 R 
/MediaBox [ 0 0 420 596 ] 
/TrimBox [ 0 0 420 596 ] 
/CropBox [ 0 0 420 596 ] 
/Resources << /ProcSet [ /PDF /Text ] /XObject << /Fm418 190 0 R /Fm419 191 0 R /Fm420 192 0 R >> 
/Font << /F354 196 0 R /F421 200 0 R /F579 204 0 R >> >> 
/Rotate 0 
>> 
endobj

Ignoring the potential extra ">>", there are three Font object references in there. If you follow the links you'll find that "202 0" and "204 0" are each defined *twice* in the document: once of type Font and once of type FontDescriptor. This violates the standard which (in section 7.3.10) says these identifiers should be unique.

So when PDFTextStripper iterates through PDResources.getFonts() it happens to find the FontDescriptor object for "204 0" before the Font object, passes it to PDFontFactory.createFont and that throws the exception.

So I see two fixes:
1. Add some kind of type-check and ignore font objects that aren't of type "Font".
2. Throw some sort of exception if we discover a pdf document with non-unique object ids.

I'd favour [2] but I wonder how many pdf documents are compliant.

      was (Author: neilm):
    I think this pdf is non-compliant with the standard. Correct me if I'm wrong.

I looked at object "151 0":
151 0 obj
<< 
/Type /Page 
/Parent 318 0 R 
/Contents 152 0 R 
/MediaBox [ 0 0 420 596 ] 
/TrimBox [ 0 0 420 596 ] 
/CropBox [ 0 0 420 596 ] 
/Resources << /ProcSet [ /PDF /Text ] /XObject << /Fm418 190 0 R /Fm419 191 0 R /Fm420 192 0 R >> 
/Font << /F354 196 0 R /F421 200 0 R /F579 204 0 R >> >> 
/Rotate 0 
>> 
endobj

Ignoring the potential extra ">>", there are three Font object references in there. If you follow the links you'll find that "202 0" and "204 0" are each defined *twice* in the document: once of type Font and once of type FontDescriptor. This violates the standard which (in section 7.3.10) says these identifiers should be unique.

So when PDFTextStripper iterates through PDResources.getFonts() it happens to find the FontDescriptor object for "204 0" before the Font object, passes it to PDFontFactory.createFont and that throws the exception.

So I see two fixes:
1. Add some kind of type-check and ignore font objects that aren't of type "Font".
2. Throw some sort of exception if we discover a pdf document with non-unique object ids.

I'd favour [2] but I wonder how many pdf documents are compliant.
  
> Text-Extraction of PDF fails
> ----------------------------
>
>                 Key: PDFBOX-569
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-569
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: 1.6.0_11
>            Reporter: Stephan Götter
>            Priority: Blocker
>         Attachments: b820GL0204.pdf
>
>
> Using trunk this Exception occurs when extracting text of attached PDF.
> [WARN] PDFParser - invalid xref line: 0
> java.io.IOException: Cannot create font if /Type is not /Font.  Actual=COSName{FontDescriptor}
> 	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:95)
> 	at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:68)
> 	at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:117)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:206)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:188)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.