You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Adam Nichols (JIRA)" <ji...@apache.org> on 2010/05/05 00:56:17 UTC

[jira] Created: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Inconsistency in parsing PDFs between Windows and Linux
-------------------------------------------------------

Key: PDFBOX-720
URL: https://issues.apache.org/jira/browse/PDFBOX-720
Project: PDFBox
Issue Type: Bug
Components: Parsing
Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
vs.
Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
Reporter: Adam Nichols
Fix For: 1.2.0

Run this same code using the same PDF and you'll get different results on Linux than on Windows. Regardless of which one you consider "correct", it should be consistent.

doc = PDDocument.load(inputFile);
PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
if(outline == null)
System.out.println("Document outline was null");
else
System.out.println("Document outline was not null");

Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one. There are two trailers, they both refer to object "1600 0" as the root. 1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0". Windows picks up the latter and shows the outline correctly. Linux picks up the former and thus returns null for the outline. I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "Adam Nichols (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Nichols updated PDFBOX-720:
--------------------------------

    Attachment: 238_Page_Report.pdf

A PDF which demonstrates the problem

> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "David Hedley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879709#action_12879709 ] 

David Hedley commented on PDFBOX-720:
-------------------------------------


To fix it properly is probably going to be a big job - I would suggest that the whole parsing module needs to be rewritten in accordance with the PDF spec. e.g. start from the end of the PDF file, read the index to the first cross-reference stream and proceed from there to build up the xref tables etc.

However you can do a quick fix based on the assumption that the "correct" xref table to use will generally be towards the end of the file:

1) Change objectPool in COSDocument to be a LinkedHashMap
2) Change parseXrefStreams to the following:
    public void parseXrefStreams() throws IOException
    {
        COSDictionary trailerDict = new COSDictionary();
        COSObject lastObject = null;
        for( COSObject xrefStream : getObjectsByType( "XRef" ) )
        {
            lastObject = xrefStream;
            COSStream stream = (COSStream)xrefStream.getObject();
            PDFXrefStreamParser parser = new PDFXrefStreamParser(stream, this);
            parser.parse();
        }
        trailerDict.addAll((COSStream)lastObject.getObject());
        setTrailer( trailerDict );
    }

This will, at least, consistently choose the last xref table (which should in general be the right one). However it then highlights another problem, this time in PDFXrefStreamParser. 
With some simple debugging in, it would appear that PDFXrefStreamParser::parse is reading garbage in the while loop (lines 104 onwards). This has a knock-on effect of not properly handling objects which have been replaced.

I will continue to investigate when I have the time


> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "David Hedley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878884#action_12878884 ] 

David Hedley commented on PDFBOX-720:
-------------------------------------

This is a more general issue of incorrect parsing of incrementally updated PDFs which I am also experiencing. Is anyone currently investigating this?

> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "David Hedley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878913#action_12878913 ] 

David Hedley commented on PDFBOX-720:
-------------------------------------


Having just looked at the code for handling cross-reference streams I'm not surprised it's failing at random - the logic in PDFParser for handling these objects is somewhat broken.

Rather than following the PDF spec, it simply merges together all the XRef objects in the PDF file (of which there will be several for files which have been incrementally updated). The ordering of the merging is dependant on the implementation of HashMap on the host system, so I'm not surprised you're seeing different results on Linux and Windows.


> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated PDFBOX-720:
---------------------------------

    Fix Version/s:     (was: 1.2.0)

> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-720) Inconsistency in parsing PDFs between Windows and Linux

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879533#action_12879533 ] 

Andreas Lehmkühler commented on PDFBOX-720:
-------------------------------------------

@David
Sounds reasonable. Any suggestions how to solve that issue?

> Inconsistency in parsing PDFs between Windows and Linux
> -------------------------------------------------------
>
>                 Key: PDFBOX-720
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-720
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows Vista 32-bit, Sun JDK 1.5.0_06, PDFBox HEAD tag (revision 941073)
> vs.
> Red Hat Linux, 2.6.9-67.ELsmp kernel, Java 1.5.0_06, PDFBox HEAD tag (revision 941073)
>            Reporter: Adam Nichols
>             Fix For: 1.2.0
>
>         Attachments: 238_Page_Report.pdf
>
>
> Run this same code using the same PDF and you'll get different results on Linux than on Windows.  Regardless of which one you consider "correct", it should be consistent.
> doc = PDDocument.load(inputFile);
> PDDocumentOutline outline = doc.getDocumentCatalog().getDocumentOutline();
> if(outline == null)
>     System.out.println("Document outline was null");
> else
>     System.out.println("Document outline was not null");
> Some interesting notes about this PDF: Seems that Acrobat Distiller 8.1.0 basically just concatenated two PDFs into one.  There are two trailers, they both refer to object "1600 0" as the root.  1600 0 appears multiple times, one time it doesn't have "Outlines" in the dictionary, the other time it has "Outlines 1667 0".  Windows picks up the latter and shows the outline correctly.  Linux picks up the former and thus returns null for the outline.  I tried debugging through PDFParser and BaseParser, but I'm not really sure how that code works and I quickly got lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.