You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/12/23 13:52:00 UTC

[jira] [Comment Edited] (PDFBOX-4007) Merged documents don't retain tags

    [ https://issues.apache.org/jira/browse/PDFBOX-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282028#comment-16282028 ] 

Tilman Hausherr edited comment on PDFBOX-4007 at 12/23/17 1:51 PM:
-------------------------------------------------------------------

"When we dig into the output we still find orphaned pages" - could you attach source and result files and explain what orphan you found? (preferably an address in PDFDebugger)

Please edit your comment from november 27 to format the code. On the website it can be done by marking, then click on the "+" and choose "code". From mail, just put a "\{code\}" line (without the quotes and without the backslash) before and after the code. If you think it was wrong, you can also edit / delete your comment, it is usually OK to "rewrite history".

The HelloWorldTagged.pdf file has at least one error: ParentTreeNextKey should be 1, not 50. I also think that the appropriate property dictionary is missing in the page resources dictionary.

I'll make a commit today or tomorrow that does a deeper search for orphan pages.


was (Author: tilman):
"When we dig into the output we still find orphaned pages" - could you attach source and result files and explain what orphan you found? (preferably an address in PDFDebugger)

Please edit your comment from november 27 to format the code. On the website it can be done by marking, then click on the "+" and choose "code". From mail, just put a "\{code\}" line (without the quotes and without the backslash) before and after the code. If you think it was wrong, you can also edit / delete your comment, it is usually OK to "rewrite history".

The HelloWorldTagged.pdf file has at least one error: ParentTreeNextKey should be 1, not 50.

I'll make a commit today or tomorrow that does a deeper search for orphan pages.

> Merged documents don't retain tags
> ----------------------------------
>
>                 Key: PDFBOX-4007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.8
>            Reporter: Dave Hill
>            Priority: Minor
>              Labels: StructureTree, merge
>         Attachments: HelloWorldTagged.pdf, PDFMergeUtility-2.patch, PDFMergeUtility.patch, Tagged+GeneralForbearance-Merged.pdf, Tagged.pdf
>
>
> Certain combinations of documents don't retain tags when merged. The document [^Tagged.pdf] is just a basic one word PDF created and tagged with Pro DC. If you try to merge this with the government [General Forbearance form|https://studentloans.gov/myDirectLoan/downloadForm.action?searchType=library&shortName=general&localeCode=en-us] the output crashes DC when you try to view the tags. If you use a flattened version of the General Forbearance form then the tags are just munged.
> {code}
>     public static void main(String[] args) throws Exception {
>         PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
>         PDDocument src = PDDocument.load(new File("Tagged.pdf"));
>         PDDocument dest = PDDocument.load(new File("GeneralForbearance.pdf"));
>         pdfMergerUtility.appendDocument(dest, src);
>         src.close();
>         dest.save(new File("BrokenTags.pdf"));
>         dest.close();
>     }
> {code}
> The included patch appears to make tagging more reliable, but I'm still relying heavily on cloning which can apparently cause other issues.  The documents I get out with this code seem present correctly in Adobe readers for all combinations of documents that I tested against.
> My patch is made and tested against yesterdays production head and it includes my changes from [PDFBOX-3999|https://issues.apache.org/jira/browse/PDFBOX-3999] since it is in the exact same place in the code.
> The priority of this is a blocker for 508 compliance of merged documents but I guessed it to be more of a minor issue in the overall scheme of things, please correct me if I am mistaken.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org