You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Martijn Brinkers (JIRA)" <ji...@apache.org> on 2010/12/10 11:42:02 UTC

[jira] Commented: (PDFBOX-917) Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream

    [ https://issues.apache.org/jira/browse/PDFBOX-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970123#action_12970123 ] 

Martijn Brinkers commented on PDFBOX-917:
-----------------------------------------

The problem is cause because the endobj is either missing or corrupt. The current code in trunk that tries to handle this situation seems to be somewhat too complicated because it fails to handle the missing endobj in a lot of cases.  I have added a patch that seems to handle non conforming PDFs with missing endobj (which happens quite often) better for most cases (on a large number of PDFs ebooks)

> Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-917
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-917
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.3.1
>         Environment: Used through Apache Tika 0.8
>            Reporter: Alex Rodriguez Lopez
>         Attachments: 2010001615.pdf, PDFBOX-917.patch
>
>
> This happened using the following PDF (~2MB): 
> http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
> When reading non-conforming PDFs like the one above the following exception is thrown and the text extraction partially fails:
> WARN - Parsing Error, Skipping Object
> java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.apache.pdfbox.io.PushBackInputStream@53ab04
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.