You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2013/12/16 17:09:10 UTC

[jira] [Issue Comment Deleted] (PDFBOX-1769) Fix crash on invalid xref

     [ https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-1769:
------------------------------------

    Comment: was deleted

(was: There's another problem in that parser:
I get this exception with the file amyuni2_05d__pdf1_3_acro4x.pdf (it was once part of the project, now no more, but it can still be found on the web):
java.io.IOException: Object (48:0) at offset 161333 does not end with 'endobj'.
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1312)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1159)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1133)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:470)
    at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:731)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
    at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1122)
    at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:134)
    at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:78)

This is true, the "endobject" is indeed missing in that file. However the content of endObjectKey is 49 0 obj, i.e. the start of a new object.

So my suggestion is to change the segment at

{code}
if (!endObjectKey.startsWith("endobj"))
{
      throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset "
                    + offsetOrObjstmObNr + " does not end with 'endobj'.");
}
{code}

to
{code}
 if (!endObjectKey.startsWith("endobj"))
 {
     if (endObjectKey.endsWith(" obj"))
         LOG.warn("Object (" + readObjNr + ":" + readObjGen + ") at offset "
             + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'");
     else
         throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset "
             + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'"); }
{code})

> Fix crash on invalid xref
> -------------------------
>
>                 Key: PDFBOX-1769
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1769
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Parsing
>    Affects Versions: 1.8.2
>            Reporter: William Palmer
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.8.4, 2.0.0
>
>
> Need to search for a correct xref start address
> Example file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf
> Exception in thread "main" java.io.IOException: Error: Expected an integer type, actual='ref'
> at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
> Using the code:
> PDFTextStripper ts = new PDFTextStripper();
> PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt")));
> RandomAccess scratchFile = new RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw");
> PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile)
> ts.setForceParsing(true);
> ts.writeText(doc, out); 
> Related: PDFBOX-1757



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Re: [jira] [Issue Comment Deleted] (PDFBOX-1769) Fix crash on invalid xref

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Please don't delete comments, if they are referenced by follow ups.

BR
Andreas Lehmkühler

Am 16.12.2013 17:09, schrieb Tilman Hausherr (JIRA):
>
>       [ https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Tilman Hausherr updated PDFBOX-1769:
> ------------------------------------
>
>      Comment: was deleted
>
> (was: There's another problem in that parser:
> I get this exception with the file amyuni2_05d__pdf1_3_acro4x.pdf (it was once part of the project, now no more, but it can still be found on the web):
> java.io.IOException: Object (48:0) at offset 161333 does not end with 'endobj'.
>      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1312)
>      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1159)
>      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:1133)
>      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:470)
>      at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:731)
>      at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
>      at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1122)
>      at pdfboxpageimageextraction.ExtractImages.doPdf(ExtractImages.java:134)
>      at pdfboxpageimageextraction.ExtractImages.main(ExtractImages.java:78)
>
> This is true, the "endobject" is indeed missing in that file. However the content of endObjectKey is 49 0 obj, i.e. the start of a new object.
>
> So my suggestion is to change the segment at
>
> {code}
> if (!endObjectKey.startsWith("endobj"))
> {
>        throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset "
>                      + offsetOrObjstmObNr + " does not end with 'endobj'.");
> }
> {code}
>
> to
> {code}
>   if (!endObjectKey.startsWith("endobj"))
>   {
>       if (endObjectKey.endsWith(" obj"))
>           LOG.warn("Object (" + readObjNr + ":" + readObjGen + ") at offset "
>               + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'");
>       else
>           throw new IOException("Object (" + readObjNr + ":" + readObjGen + ") at offset "
>               + offsetOrObjstmObNr + " does not end with 'endobj' but with '" + endObjectKey + "'"); }
> {code})
>
>> Fix crash on invalid xref
>> -------------------------
>>
>>                  Key: PDFBOX-1769
>>                  URL: https://issues.apache.org/jira/browse/PDFBOX-1769
>>              Project: PDFBox
>>           Issue Type: Wish
>>           Components: Parsing
>>     Affects Versions: 1.8.2
>>             Reporter: William Palmer
>>             Assignee: Andreas Lehmkühler
>>              Fix For: 1.8.4, 2.0.0
>>
>>
>> Need to search for a correct xref start address
>> Example file:
>> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf
>> Exception in thread "main" java.io.IOException: Error: Expected an integer type, actual='ref'
>> at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
>> Using the code:
>> PDFTextStripper ts = new PDFTextStripper();
>> PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt")));
>> RandomAccess scratchFile = new RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw");
>> PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile)
>> ts.setForceParsing(true);
>> ts.writeText(doc, out);
>> Related: PDFBOX-1757
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1.4#6159)
>