You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2011/03/16 17:46:29 UTC
[jira] Resolved: (PDFBOX-978) unreading of trailing content after 'endobj' is missing new line byte (fix included)

     [ https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols resolved PDFBOX-978.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.0
         Assignee: Adam Nichols

Patch committed in revision 1082195.  This patch is a good, safe fix given the current implementation.

I'd argue that the code should really be reading in one object (i.e. discard leading white space, read until whitespace) instead of reading the entire line, but since I don't have time to make and test that, we'll just stick to the current method with this patch.  I don't want to break anything just because I was in a hurry.

> unreading of trailing content after 'endobj' is missing new line byte (fix included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " xref".
> However the newline (in this case ^D) is lost. This is already documented in the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj so we 
>                      * need to make sure that the next object number is getting read separately
>                      * and not part of the endobj keyword. Ex. Some files would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira