You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2011/03/15 09:32:29 UTC

[jira] Created: (PDFBOX-978) unreading of trailing content after 'endobj' is missing new line byte (fix included)

unreading of trailing content after 'endobj' is missing new line byte (fix included)
------------------------------------------------------------------------------------

                 Key: PDFBOX-978
                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 1.6.0
            Reporter: Timo Boehme


I have several journal PDFs where the last xref section starts like

endobj xref
0 92
0000000000 65535 f
0000000044 00000 n

in this cases the PDF parser reads the endobj line completely and unreads " xref".
However the newline (in this case ^D) is lost. This is already documented in the
method readline() within PDFParser:
"Note: if you later unread the results of this function, you'll
need to add a newline character to the end of the string."

Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.

The fix:
in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:

// add a space first in place of the newline consumed by readline()
pdfSource.unread( SPACE_BYTE );

thus we get:
                if (endObjectKey.startsWith( "endobj" ) ) 
                {
                    /*
                     * Some PDF files don't contain a new line after endobj so we 
                     * need to make sure that the next object number is getting read separately
                     * and not part of the endobj keyword. Ex. Some files would have "endobj28"
                     * instead of "endobj"
                     */
                    // add a space first in place of the newline consumed by readline()
                    pdfSource.unread( SPACE_BYTE );
                    pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") );
                } 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-978) unreading of trailing content after 'endobj' is missing new line byte (fix included)

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009197#comment-13009197 ] 

Adam Nichols commented on PDFBOX-978:
-------------------------------------

Fixed in revision 1083858.  Thanks again.

> unreading of trailing content after 'endobj' is missing new line byte (fix included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " xref".
> However the newline (in this case ^D) is lost. This is already documented in the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj so we 
>                      * need to make sure that the next object number is getting read separately
>                      * and not part of the endobj keyword. Ex. Some files would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-978) unreading of trailing content after 'endobj' is missing new line byte (fix included)

Posted by "Timo Boehme (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009055#comment-13009055 ] 

Timo Boehme commented on PDFBOX-978:
------------------------------------

The patch was applied to another code block as intended by me. The patched region is ok, but the problem stated in my report persists.
Thus the patch should also be applied 2 blocks above within block starting with
if (endObjectKey.startsWith( "endobj" ) ) 

> unreading of trailing content after 'endobj' is missing new line byte (fix included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " xref".
> However the newline (in this case ^D) is lost. This is already documented in the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj so we 
>                      * need to make sure that the next object number is getting read separately
>                      * and not part of the endobj keyword. Ex. Some files would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (PDFBOX-978) unreading of trailing content after 'endobj' is missing new line byte (fix included)

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols resolved PDFBOX-978.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.0
         Assignee: Adam Nichols

Patch committed in revision 1082195.  This patch is a good, safe fix given the current implementation.

I'd argue that the code should really be reading in one object (i.e. discard leading white space, read until whitespace) instead of reading the entire line, but since I don't have time to make and test that, we'll just stick to the current method with this patch.  I don't want to break anything just because I was in a hurry.

> unreading of trailing content after 'endobj' is missing new line byte (fix included)
> ------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-978
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-978
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.6.0
>            Reporter: Timo Boehme
>            Assignee: Adam Nichols
>             Fix For: 1.6.0
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> I have several journal PDFs where the last xref section starts like
> endobj xref
> 0 92
> 0000000000 65535 f
> 0000000044 00000 n
> in this cases the PDF parser reads the endobj line completely and unreads " xref".
> However the newline (in this case ^D) is lost. This is already documented in the
> method readline() within PDFParser:
> "Note: if you later unread the results of this function, you'll
> need to add a newline character to the end of the string."
> Currently I get an error like: "expected='obj' actual='655'" because the 'xref' is read as 'xref0'.
> The fix:
> in PDFParser insert before line 579 (the unreading of trailing characters after 'endobj') the lines:
> // add a space first in place of the newline consumed by readline()
> pdfSource.unread( SPACE_BYTE );
> thus we get:
>                 if (endObjectKey.startsWith( "endobj" ) ) 
>                 {
>                     /*
>                      * Some PDF files don't contain a new line after endobj so we 
>                      * need to make sure that the next object number is getting read separately
>                      * and not part of the endobj keyword. Ex. Some files would have "endobj28"
>                      * instead of "endobj"
>                      */
>                     // add a space first in place of the newline consumed by readline()
>                     pdfSource.unread( SPACE_BYTE );
>                     pdfSource.unread( endObjectKey.substring( 6 ).getBytes("ISO-8859-1") );
>                 } 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira