You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "ChrisN (JIRA)" <ji...@apache.org> on 2011/01/20 17:17:44 UTC

[jira] Commented: (PDFBOX-917) Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream

    [ https://issues.apache.org/jira/browse/PDFBOX-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984260#action_12984260 ] 

ChrisN commented on PDFBOX-917:
-------------------------------

Patch works quite well for me. Also fixed problem for other files too. I've voted for it.

> Read non-conforming PDFs (attached) without throwing java.io.IOException: expected='endobj' org.apache.pdfbox.io.PushBackInputStream
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-917
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-917
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 1.3.1
>         Environment: Used through Apache Tika 0.8
>            Reporter: Alex Rodriguez Lopez
>         Attachments: 2010001615.pdf, PDFBOX-917.patch
>
>
> This happened using the following PDF (~2MB): 
> http://biblioteca.sinbad.ua.pt/DisQSws/get.aspx?filename=2010001615.pdf&catalog=Teses&type=pdf
> When reading non-conforming PDFs like the one above the following exception is thrown and the text extraction partially fails:
> WARN - Parsing Error, Skipping Object
> java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.apache.pdfbox.io.PushBackInputStream@53ab04
>         at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:218)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:84) 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.