You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Dave Hill (JIRA)" <ji...@apache.org> on 2018/01/09 21:59:02 UTC

[jira] [Commented] (PDFBOX-4007) Merged documents don't retain tags

    [ https://issues.apache.org/jira/browse/PDFBOX-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319259#comment-16319259 ] 

Dave Hill commented on PDFBOX-4007:
-----------------------------------

I'm trying to find an example of the PDF which I commented "When we dig into the output we still find orphaned pages" and I am not able to reproduce the PDF that prompted that comment using the current 3.0-SNAPSHOT. When I made that comment I recall I was using human readable PDFs and I was looking through the output and saw pages were duplicated, but that the duplicates did not tie back to the root object. This is what I meant by "more effectively orphaned". The output I am now getting from the development head (with and without the tag I proposed) is failing for a number of different reasons depending on the combinations of test files that I try and the order I try them in. I see output with no tags, with mangled tags, and even one case where the tagged page is completely missing.

> Merged documents don't retain tags
> ----------------------------------
>
>                 Key: PDFBOX-4007
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4007
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.0.8
>            Reporter: Dave Hill
>            Priority: Minor
>              Labels: StructureTree, merge
>         Attachments: HelloWorldTagged.pdf, PDFMergeUtility-2.patch, PDFMergeUtility.patch, Tagged+GeneralForbearance-Merged.pdf, Tagged.pdf
>
>
> Certain combinations of documents don't retain tags when merged. The document [^Tagged.pdf] is just a basic one word PDF created and tagged with Pro DC. If you try to merge this with the government [General Forbearance form|https://studentloans.gov/myDirectLoan/downloadForm.action?searchType=library&shortName=general&localeCode=en-us] the output crashes DC when you try to view the tags. If you use a flattened version of the General Forbearance form then the tags are just munged.
> {code}
>     public static void main(String[] args) throws Exception {
>         PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
>         PDDocument src = PDDocument.load(new File("Tagged.pdf"));
>         PDDocument dest = PDDocument.load(new File("GeneralForbearance.pdf"));
>         pdfMergerUtility.appendDocument(dest, src);
>         src.close();
>         dest.save(new File("BrokenTags.pdf"));
>         dest.close();
>     }
> {code}
> The included patch appears to make tagging more reliable, but I'm still relying heavily on cloning which can apparently cause other issues.  The documents I get out with this code seem present correctly in Adobe readers for all combinations of documents that I tested against.
> My patch is made and tested against yesterdays production head and it includes my changes from [PDFBOX-3999|https://issues.apache.org/jira/browse/PDFBOX-3999] since it is in the exact same place in the code.
> The priority of this is a blocker for 508 compliance of merged documents but I guessed it to be more of a minor issue in the overall scheme of things, please correct me if I am mistaken.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org