You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/08/21 01:53:24 UTC

[jira] Created: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Objects from streams overwrite objects already read with the same ID/Generation
-------------------------------------------------------------------------------

                 Key: PDFBOX-796
                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
            Reporter: Adam Nichols
            Assignee: Adam Nichols
             Fix For: 1.3.0


When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).

Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.

Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.

Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905123#action_12905123 ] 

Adam Nichols commented on PDFBOX-796:
-------------------------------------

Just updated COSDocument.  Previously it was:
if(objectPool.get(key) == null)
but it should be (and now is):
if(objectPool.get(key) == null || objectPool.get(key).getObject() == null)
because if a reference to an object was found, it will be in the object pool.  This is because we know the object exists (or at least that it should exist).  Now the code will fill in the object properly.  This fix ensures that PDFs created with Acrobat Web Capture 9.0 can be processed properly.  Committed in revision 991629.

> Objects from streams overwrite objects already read with the same ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.
> Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols updated PDFBOX-796:
--------------------------------

    Issue Type: Improvement  (was: Bug)
      Due Date: 27/Aug/10  (was: 20/Aug/10)

> Objects from streams overwrite objects already read with the same ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.
> Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902760#action_12902760 ] 

Andreas Lehmkühler commented on PDFBOX-796:
-------------------------------------------

The patch looks good to me. But in the long run we should add support for incremental updates and signed documents so that this workaround will become redundant.

> Objects from streams overwrite objects already read with the same ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.
> Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols resolved PDFBOX-796.
---------------------------------

    Resolution: Fixed

Patch committed in revision 989838.  Even with incremental updates, there shouldn't be any objects with both the same key (object id and generation), so this will probably not change when support for incremental updates is completed.

> Objects from streams overwrite objects already read with the same ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.
> Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-796) Objects from streams overwrite objects already read with the same ID/Generation

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols updated PDFBOX-796:
--------------------------------

    Attachment: PDFBOX-796.patch

Albeit small, this change is altering COSDocument, which is a fundamental part of PDFBox, I'm including the patch for review.  If I do not hear any concerns nor objections in the next week, I will commit it to SVN.

> Objects from streams overwrite objects already read with the same ID/Generation
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-796
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-796
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: 32-bit Windows Vista, Java 1.5, PDFBox head tag
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>             Fix For: 1.3.0
>
>         Attachments: PDFBOX-796.patch
>
>
> When trying to merge some documents (using the PDFMergerUtility class) I got a NullPointerException and the merge failed.  I traced through to eventually discover that some objects were being overwritten when the PDFParser called document.dereferenceObjectStreams(); (line 207 of PDFParser.java).
> Having multiple objects with the same object ID is a violation of the PDF specification, so how this should be dealt with is undefined.  The "use the first object" mentality enabled my file to be processed and it is consistent with the other code in PDFBox.  For another example of where PDFBox deals with reading in an object which already exists, you can see PDFParser (on line 541) checks to see if the object has already been read and put in the pool.  If not, it adds it to the list of conflicts.  Later, when resolveConflicts() is called, it overwrites the object only if it's specifically referenced in the xref table.  This is a reasonable way to resolve conflicts because if the object isn't in the xref table, it is likely the wrong one.
> Since we're reading from a stream of compressed data, we can not give a particular byte offset.  This means we can't add these conflicts to the conflict list and try to determine if this object is legitimate or not.  It's best to use the data we've already read, as using the one from the stream has been confirmed to cause problems.  I've done regression testing with other files which have this problem, including the file from PDFBOX-720 and have not seen any issues.
> Unfortunately I can not provide the PDF which demonstrates this problem and solution as it contains information I'm not authorized to release.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.