You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/11/09 16:55:01 UTC

[jira] [Comment Edited] (PDFBOX-3999) Merge failed to clone tags

    [ https://issues.apache.org/jira/browse/PDFBOX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245992#comment-16245992 ] 

Tilman Hausherr edited comment on PDFBOX-3999 at 11/9/17 4:54 PM:
------------------------------------------------------------------

1)
Your patch breaks the result document.

The pages of the result document have the object numbers 14..21.

In the result document created with the modified PDFBox version, have a look at
{{Root/StructTreeRoot/ParentTree/Nums/\[85]/Pg}} with PDFDebugger.

That is a page and has the object number 496. I suspect that this is a run away clone.

Cloning is risky and may need some post processing: PDFBOX-3972. And it's not just pages, but all elements.

With the existing code that does not happen, the object number is 18.

I'm sorry that I mentioned cloning myself in the SO issue and maybe drove you into a dead end. I don't know cloning is the solution or not. I have opened two new issues (PDFBOX-4003 and PDFBOX-4004).

Run this code to check your solution:
{code}
    public static void main(String[] args) throws IOException
    {
        PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
        PDDocument src = PDDocument.load(new File("GovFormPreFlattened.pdf"));
        PDDocument dst = PDDocument.load(new File("GovFormPreFlattened.pdf"));
        pdfMergerUtility.appendDocument(dst, src);
        src.close(); //if we don't close the src then we don't have an error
        dst.save(new File("GovFormPreFlattened-merged.pdf"));
        dst.close();

        PDDocument doc = PDDocument.load(new File("GovFormPreFlattened-merged.pdf"));
        PDPageTree pageTree = doc.getPages();
        PDNumberTreeNode parentTree = doc.getDocumentCatalog().getStructureTreeRoot().getParentTree();
        COSArray numArray = (COSArray) parentTree.getCOSObject().getDictionaryObject(COSName.NUMS);
        for (COSBase base : numArray)
        {
            if (base instanceof COSObject)
            {
                base = ((COSObject) base).getObject();
            }
            if (base instanceof COSArray)
            {
                for (COSBase base2 : (COSArray) base)
                {
                    if (base2 instanceof COSObject)
                    {
                        base2 = ((COSObject) base2).getObject();
                    }
                    checkForPage(pageTree, base2);
                }
            }
            else if (base instanceof COSDictionary)
            {
                checkForPage(pageTree, base);
            }
        }
    }

    private static void checkForPage(PDPageTree pageTree, COSBase base2)
    {
        COSDictionary dict = (COSDictionary) base2;
        if (dict.containsKey(COSName.PG))
        {
            PDPage page = new PDPage((COSDictionary) dict.getDictionaryObject(COSName.PG));
            int pageIndex = pageTree.indexOf(page);
            if (pageIndex < 0)
            {
                System.out.println("Oh no!");
            }
        }
    }
{code}


2)
Minor, unrelated - this code:
{code}
+                COSArray srcKArrayClone = new COSArray();
+                for (COSBase next : srcKArray) {
+                    srcKArrayClone.add(cloner.cloneForNewDocument(next));
+                }
{code}
why not just do {{srcKArrayClone = cloner.cloneForNewDocument(srcKArray)}}?


was (Author: tilman):
1)
Your patch breaks the result document.

The pages of the result document have the object numbers 14..21.

In the result document created with the modified PDFBox version, have a look at
{{Root/StructTreeRoot/ParentTree/Nums/\[85]/Pg}} with PDFDebugger.

That is a page and has the object number 496. I suspect that this is a run away clone.

Cloning is risky and may need some post processing: PDFBOX-3972. And it's not just pages, but all elements.

With the existing code that does not happen, the object number is 18.

I'm sorry that I mentioned cloning myself in the SO issue and maybe drove you into a dead end. I don't know cloning is the solution or not. I'll open two new issues including my "suspicion is that it is related to not removing the fields from the structure tree when flattening".

Run this code to check your solution:
{code}
    public static void main(String[] args) throws IOException
    {
        PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
        PDDocument src = PDDocument.load(new File("GovFormPreFlattened.pdf"));
        PDDocument dst = PDDocument.load(new File("GovFormPreFlattened.pdf"));
        pdfMergerUtility.appendDocument(dst, src);
        src.close(); //if we don't close the src then we don't have an error
        dst.save(new File("GovFormPreFlattened-merged.pdf"));
        dst.close();

        PDDocument doc = PDDocument.load(new File("GovFormPreFlattened-merged.pdf"));
        PDPageTree pageTree = doc.getPages();
        PDNumberTreeNode parentTree = doc.getDocumentCatalog().getStructureTreeRoot().getParentTree();
        COSArray numArray = (COSArray) parentTree.getCOSObject().getDictionaryObject(COSName.NUMS);
        for (COSBase base : numArray)
        {
            if (base instanceof COSObject)
            {
                base = ((COSObject) base).getObject();
            }
            if (base instanceof COSArray)
            {
                for (COSBase base2 : (COSArray) base)
                {
                    if (base2 instanceof COSObject)
                    {
                        base2 = ((COSObject) base2).getObject();
                    }
                    checkForPage(pageTree, base2);
                }
            }
            else if (base instanceof COSDictionary)
            {
                checkForPage(pageTree, base);
            }
        }
    }

    private static void checkForPage(PDPageTree pageTree, COSBase base2)
    {
        COSDictionary dict = (COSDictionary) base2;
        if (dict.containsKey(COSName.PG))
        {
            PDPage page = new PDPage((COSDictionary) dict.getDictionaryObject(COSName.PG));
            int pageIndex = pageTree.indexOf(page);
            if (pageIndex < 0)
            {
                System.out.println("Oh no!");
            }
        }
    }
{code}


2)
Minor, unrelated - this code:
{code}
+                COSArray srcKArrayClone = new COSArray();
+                for (COSBase next : srcKArray) {
+                    srcKArrayClone.add(cloner.cloneForNewDocument(next));
+                }
{code}
why not just do {{srcKArrayClone = cloner.cloneForNewDocument(srcKArray)}}?

> Merge failed to clone tags
> --------------------------
>
>                 Key: PDFBOX-3999
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3999
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.8
>            Reporter: Dave Hill
>            Priority: Critical
>         Attachments: GovFormPreFlattened.pdf, pdfbox.patch
>
>
> After merging two tagged documents, closing the source document causes the destination document to be closed, which prevents it from being saved. The following code demonstrates the bug with the attached flattened government PDF file. The original is available [here|https://studentloans.gov/myDirectLoan/downloadForm.action?searchType=library&shortName=general&localeCode=en-us] if you need it.
> {code}
> @Test
> public void testMerge() throws Exception {
>     PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
>     PDDocument src = PDDocument.load(new File("GovFormPreFlattened.pdf"));
>     PDDocument dest = PDDocument.load(new File("GovFormPreFlattened.pdf"));
>     pdfMergerUtility.appendDocument(dest, src);
>     src.close(); //if we don't close the src then we don't have an error
>     dest.save(File.createTempFile("MergeIssue",".PDF"));
>     dest.close();
> }
> {code}
> The issue is resolved with the attached patch.
> Also I removed the "if (mergeStructTree)" is because mergeStructTree is always true here because this code is already inside an "if (mergeStructTree)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org