You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Wiley Fuller (JIRA)" <ji...@apache.org> on 2010/08/27 02:53:53 UTC

[jira] Commented: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

    [ https://issues.apache.org/jira/browse/PDFBOX-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903165#action_12903165 ] 

Wiley Fuller commented on PDFBOX-506:
-------------------------------------

I'm also experiencing this problem.  Looking at the code, it appears that PDFBox assumes the existence of one xref table per PDF file.  This won't necessarily be the case. 
Reading the PDF spec (1.7),  section 7.5.6 describes a situation where multiple updates cause more objects and corresponding xref tables to be appended to the file. 
Each subsequent startxref declaration is also preceeded by a /Prev line, which points to the previous xref table. 

Incremental updates can also contain new versions of already existing objects.

Looks like it'll be more than a 5 line patch to fix this issue.


27/08/2010 10:35:58 AM pdftesting.Main main
SEVERE: null
java.io.IOException: Error: Expected to read '%%EOF' instead started reading '1'
        at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1090)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:463)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826)
        at org.apache.pdfbox.util.PDFMergerUtility.mergeDocuments(PDFMergerUtility.java:191)
        at pdftesting.Main.main(Main.java:34)

> PDFBox can't parse PDF documents from jstor.org
> -----------------------------------------------
>
>                 Key: PDFBOX-506
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Dave Engberg
>         Attachments: siegel.pdf
>
>
> The academic repository JStor makes papers available via PDF format.  The PDFs give this origin information:
>   Content creator:  JstorPdfGenerator v1.0
>   PDF producer:  iText 2.0.6 (by lowagie.com)
> These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an exception in PDFBox:
> Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF' instead started reading '1'
> 	at org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
> 	at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
> 	at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
> 	at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
> I traced through the code, and it appears that PDFBox rejects these because they contain a 'startxref' that is not followed by a %%EOF two lines later:
> ...
> startxref
> 613364
> 1 0 obj
> ...
> Here's a small patch that will accept files that are missing the EOF after the startxref:
> Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
> ===================================================================
> --- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java	(revision 802578)
> +++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java	(working copy)
> @@ -453,11 +453,9 @@
>              {  
>                  parseStartXref();
>                  //verify that EOF exists 
> -                String eof = readExpectedString( "%%EOF" );
> -                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
> -                {
> -                    throw new IOException( "expected='%%EOF' actual='" + eof + "' next=" + readString() +
> -                            " next=" +readString() );
> +                int c = pdfSource.peek();
> +                if (c == '%') {
> +                    readExpectedString("%%EOF");
>                  }
>                  isEndOfFile = true; 
>              }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.