You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andre (Jira)" <ji...@apache.org> on 2022/10/18 09:30:00 UTC

[jira] [Updated] (PDFBOX-5528) PDF/UA: Add marked content sections when flattening acro forms

     [ https://issues.apache.org/jira/browse/PDFBOX-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andre updated PDFBOX-5528:
--------------------------
    Description: 
We need to support PDF/UA compliant documents to some extent. I noticed that when we take a PDF/UA compliant PDF document and flatten it via PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore.

After a little bit of research, the problem is that PDFBox creates /DO operators with paths representing the appearance of the form fields. According to the PDF/UA standard, such paths need to be enclosed in marked content sections (BMC ... EMC, BDC ... EMC, see attached images)

By copying some code from AcroForm#flatten and adding contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I can workaround the problem, but that's less than ideal, it would be great if this could be included in PDFBox.

 
{code:java}
           final var dict = new COSDictionary();
           dict.setLong(COSName.MCID, mcid);
           dict.setItem(COSName.BBOX, bBox);
           dict.setItem(COSName.TYPE, COSName.BACKGROUND);
            final var propList = PDPropertyList.create(dict);
            contentStream.beginMarkedContent(COSName.ARTIFACT, propList);
            contentStream.saveGraphicsState();
            // see https://stackoverflow.com/a/54091766/1729265 for an explanation
            // of the steps required
            // this will transform the appearance stream form object into the rectangle of the
            // annotation bbox and map the coordinate systems
            final var transformationMatrix = pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream);
            contentStream.transform(transformationMatrix);
            contentStream.drawForm(fieldObject);
            contentStream.restoreGraphicsState();
            contentStream.endMarkedContent();
 
{code}
 

  was:
We need to support PDF/UA compliant documents to some extent. I noticed that when we take a PDF/UA compliant PDF document and flatten it via PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore.

After a little bit of research, the problem is that PDFBox creates /DO operators with paths representing the appearance of the form fields. According to the PDF/UA standard, such paths need to be enclosed in marked content sections (BMC ... EMC, BDC ... EMC, see attached images)

By copying some code from AcroForm#flatten and adding contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I can workaround the problem, but that's less than ideal, it would be great if this could be included in PDFBox.

<pre> 

           final var dict = new COSDictionary();
           dict.setLong(COSName.MCID, mcid);
           dict.setItem(COSName.BBOX, bBox);
           dict.setItem(COSName.TYPE, COSName.BACKGROUND);
            final var propList = PDPropertyList.create(dict);
            contentStream.beginMarkedContent(COSName.ARTIFACT, propList);

            contentStream.saveGraphicsState();

            // see https://stackoverflow.com/a/54091766/1729265 for an explanation
            // of the steps required
            // this will transform the appearance stream form object into the rectangle of the
            // annotation bbox and map the coordinate systems
            final var transformationMatrix = pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream);

            contentStream.transform(transformationMatrix);
            contentStream.drawForm(fieldObject);
            contentStream.restoreGraphicsState();

            contentStream.endMarkedContent();

</pre>


> PDF/UA: Add marked content sections when flattening acro forms
> --------------------------------------------------------------
>
>                 Key: PDFBOX-5528
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5528
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: AcroForm
>            Reporter: Andre
>            Priority: Minor
>         Attachments: correct.png, wrong.png
>
>
> We need to support PDF/UA compliant documents to some extent. I noticed that when we take a PDF/UA compliant PDF document and flatten it via PDAcroForm#flatten, the resulting output is not PDF/UA compliant anymore.
> After a little bit of research, the problem is that PDFBox creates /DO operators with paths representing the appearance of the form fields. According to the PDF/UA standard, such paths need to be enclosed in marked content sections (BMC ... EMC, BDC ... EMC, see attached images)
> By copying some code from AcroForm#flatten and adding contentStream.beginMarkedContent and contentStream.endMarkedContent myself, I can workaround the problem, but that's less than ideal, it would be great if this could be included in PDFBox.
>  
> {code:java}
>            final var dict = new COSDictionary();
>            dict.setLong(COSName.MCID, mcid);
>            dict.setItem(COSName.BBOX, bBox);
>            dict.setItem(COSName.TYPE, COSName.BACKGROUND);
>             final var propList = PDPropertyList.create(dict);
>             contentStream.beginMarkedContent(COSName.ARTIFACT, propList);
>             contentStream.saveGraphicsState();
>             // see https://stackoverflow.com/a/54091766/1729265 for an explanation
>             // of the steps required
>             // this will transform the appearance stream form object into the rectangle of the
>             // annotation bbox and map the coordinate systems
>             final var transformationMatrix = pdfbox_resolveTransformationMatrix(form, annotation, appearanceStream);
>             contentStream.transform(transformationMatrix);
>             contentStream.drawForm(fieldObject);
>             contentStream.restoreGraphicsState();
>             contentStream.endMarkedContent();
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org