You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Henning, Klaus" <KH...@eitco.de> on 2015/03/12 11:54:24 UTC

how to create structure for an existing PDF document

Hi,

we want to create the structure to an existing PDF document. We have PDF documents from a scanner which contains Images but no structure.
We want to implement a program to create the structure so we can add AlternateDescriptions to the images based on tesaract ocr recognition.

Our first approach creates a structure but the structure seems to be incomplete when checking it with adobe acrobat. We can't find any hints in the pdfbox examples
or documentation how to do this.

Our Code snippet:

             try {
                    PDDocument document = PDDocument.load("test.pdf");
                    PDDocumentCatalog documentCatalog = document.getDocumentCatalog();

                    PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();

                    if(treeRoot == null){
                           COSDictionary cosDictionary = documentCatalog.getCOSDictionary();
                           PDStructureTreeRoot newTreeRoot = new PDStructureTreeRoot();

                           //iterate over pages
                           List<?> pages = documentCatalog.getAllPages();
                           for (Object object : pages) {
                                  PDPage page = (PDPage) object;
                                  Map<String,PDXObject> mapObjects = page.getResources().getXObjects();
                                  for (PDXObject pdxObject : mapObjects.values()) {
                                        if(pdxObject instanceof PDXObjectImage){
                                               PDXObjectImage objectImage = (PDXObjectImage)pdxObject;
                                               //new SturctureElement for the image
                                               PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure,newTreeRoot);
                                               PDMarkedContent markedContent = new PDMarkedContent(COSName.IMAGE,  new COSDictionary());
                                               markedContent.addXObject(objectImage);
                                               structureElement.appendKid(markedContent);
                                               structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");
                                               newTreeRoot.appendKid(structureElement);
                                        }
                                  }
                           }

                           documentCatalog.setStructureTreeRoot(newTreeRoot);
                           treeRoot = documentCatalog.getStructureTreeRoot();
                    }

                    document.save("testWithTree.pdf");
                    document.close();
             }
             catch (IOException e) {
                    e.printStackTrace();
             }
             catch (COSVisitorException e) {
                    e.printStackTrace();
             }

Can someone help us her?

Best regards,

Klaus Henning


______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

Re: how to create structure for an existing PDF document

Posted by Olaf Drümmer <ol...@callassoftware.com>.
Hi Klaus,

if you are creating tagged PDF for accessibility purposes, and your PDFs consist of one image per page, the use of alternate descriptions is questionable (unless each scanned page is actually perceived as an image, and not as a scanned image of text). Where images from scanned pages consist of text, it is instead more suitable to use OCR and place an invisible text layer onto each page. The text in this invisible text layer would then be organised into a suitable tagging structure.

The standard applicable to accessible tagged PDF is PDF/UA (ISO 14289-1). The PDF Association's website (www.pdfa.org) offers an introductory booklet (as PDF for free of charge download), and if you are interested in the more technical details, it is worthwhile looking at the Matterhorn Protocol, which explains in detail about the rules an accessible PDF has to comply with if it aims to  conform with the PDF/UA standard.

Olaf

PS: I have no clue about the PDFBox API, so I cannot be of help in this regard.


On 12 Mar 2015, at 13:20, "Henning, Klaus" <KH...@eitco.de> wrote:

> Hi Olaf,
> 
> I want to create structure in the sense of a tagged PDF, so I can add or modify the alternate description of all images in an PDF document. 
> The only way I found to add or modify an the alternate description of an image is via the structreelement of it.
> 
> structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");
> 
> As described in my previous mail I want to recognitze the text in the image via ocr and write the returned text to the alternate description of the image.
> 
> 
> Klaus
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Olaf Drümmer [mailto:olaflist@callassoftware.com] 
> Gesendet: Donnerstag, 12. März 2015 12:33
> An: users@pdfbox.apache.org
> Cc: Olaf Drümmer
> Betreff: Re: how to create structure for an existing PDF document
> 
> Hi Klaus,
> 
> what kind of structure do you wish to create? Structure in the sense of tagged PDF, or just some logical structure, and if so, for what purposes?
> 
> Olaf
> 
> 
> On 12 Mar 2015, at 11:54, "Henning, Klaus" <KH...@eitco.de> wrote:
> 
>> Hi,
>> 
>> we want to create the structure to an existing PDF document. We have PDF documents from a scanner which contains Images but no structure.
>> We want to implement a program to create the structure so we can add AlternateDescriptions to the images based on tesaract ocr recognition.
>> 
>> Our first approach creates a structure but the structure seems to be 
>> incomplete when checking it with adobe acrobat. We can't find any hints in the pdfbox examples or documentation how to do this.
>> 
>> Our Code snippet:
>> 
>>            try {
>>                   PDDocument document = PDDocument.load("test.pdf");
>>                   PDDocumentCatalog documentCatalog = 
>> document.getDocumentCatalog();
>> 
>>                   PDStructureTreeRoot treeRoot = 
>> document.getDocumentCatalog().getStructureTreeRoot();
>> 
>>                   if(treeRoot == null){
>>                          COSDictionary cosDictionary = documentCatalog.getCOSDictionary();
>>                          PDStructureTreeRoot newTreeRoot = new 
>> PDStructureTreeRoot();
>> 
>>                          //iterate over pages
>>                          List<?> pages = documentCatalog.getAllPages();
>>                          for (Object object : pages) {
>>                                 PDPage page = (PDPage) object;
>>                                 Map<String,PDXObject> mapObjects = page.getResources().getXObjects();
>>                                 for (PDXObject pdxObject : mapObjects.values()) {
>>                                       if(pdxObject instanceof PDXObjectImage){
>>                                              PDXObjectImage objectImage = (PDXObjectImage)pdxObject;
>>                                              //new SturctureElement for the image
>>                                              PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure,newTreeRoot);
>>                                              PDMarkedContent markedContent = new PDMarkedContent(COSName.IMAGE,  new COSDictionary());
>>                                              markedContent.addXObject(objectImage);
>>                                              structureElement.appendKid(markedContent);
>>                                              structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");
>>                                              newTreeRoot.appendKid(structureElement);
>>                                       }
>>                                 }
>>                          }
>> 
>>                          documentCatalog.setStructureTreeRoot(newTreeRoot);
>>                          treeRoot = documentCatalog.getStructureTreeRoot();
>>                   }
>> 
>>                   document.save("testWithTree.pdf");
>>                   document.close();
>>            }
>>            catch (IOException e) {
>>                   e.printStackTrace();
>>            }
>>            catch (COSVisitorException e) {
>>                   e.printStackTrace();
>>            }
>> 
>> Can someone help us her?
>> 
>> Best regards,
>> 
>> Klaus Henning
>> 
>> 
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com 
>> ______________________________________________________________________
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com ______________________________________________________________________
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: how to create structure for an existing PDF document

Posted by "Henning, Klaus" <KH...@eitco.de>.
Hi Olaf,

I want to create structure in the sense of a tagged PDF, so I can add or modify the alternate description of all images in an PDF document. 
The only way I found to add or modify an the alternate description of an image is via the structreelement of it.

structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");

As described in my previous mail I want to recognitze the text in the image via ocr and write the returned text to the alternate description of the image.


Klaus


-----Ursprüngliche Nachricht-----
Von: Olaf Drümmer [mailto:olaflist@callassoftware.com] 
Gesendet: Donnerstag, 12. März 2015 12:33
An: users@pdfbox.apache.org
Cc: Olaf Drümmer
Betreff: Re: how to create structure for an existing PDF document

Hi Klaus,

what kind of structure do you wish to create? Structure in the sense of tagged PDF, or just some logical structure, and if so, for what purposes?

Olaf


On 12 Mar 2015, at 11:54, "Henning, Klaus" <KH...@eitco.de> wrote:

> Hi,
> 
> we want to create the structure to an existing PDF document. We have PDF documents from a scanner which contains Images but no structure.
> We want to implement a program to create the structure so we can add AlternateDescriptions to the images based on tesaract ocr recognition.
> 
> Our first approach creates a structure but the structure seems to be 
> incomplete when checking it with adobe acrobat. We can't find any hints in the pdfbox examples or documentation how to do this.
> 
> Our Code snippet:
> 
>             try {
>                    PDDocument document = PDDocument.load("test.pdf");
>                    PDDocumentCatalog documentCatalog = 
> document.getDocumentCatalog();
> 
>                    PDStructureTreeRoot treeRoot = 
> document.getDocumentCatalog().getStructureTreeRoot();
> 
>                    if(treeRoot == null){
>                           COSDictionary cosDictionary = documentCatalog.getCOSDictionary();
>                           PDStructureTreeRoot newTreeRoot = new 
> PDStructureTreeRoot();
> 
>                           //iterate over pages
>                           List<?> pages = documentCatalog.getAllPages();
>                           for (Object object : pages) {
>                                  PDPage page = (PDPage) object;
>                                  Map<String,PDXObject> mapObjects = page.getResources().getXObjects();
>                                  for (PDXObject pdxObject : mapObjects.values()) {
>                                        if(pdxObject instanceof PDXObjectImage){
>                                               PDXObjectImage objectImage = (PDXObjectImage)pdxObject;
>                                               //new SturctureElement for the image
>                                               PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure,newTreeRoot);
>                                               PDMarkedContent markedContent = new PDMarkedContent(COSName.IMAGE,  new COSDictionary());
>                                               markedContent.addXObject(objectImage);
>                                               structureElement.appendKid(markedContent);
>                                               structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");
>                                               newTreeRoot.appendKid(structureElement);
>                                        }
>                                  }
>                           }
> 
>                           documentCatalog.setStructureTreeRoot(newTreeRoot);
>                           treeRoot = documentCatalog.getStructureTreeRoot();
>                    }
> 
>                    document.save("testWithTree.pdf");
>                    document.close();
>             }
>             catch (IOException e) {
>                    e.printStackTrace();
>             }
>             catch (COSVisitorException e) {
>                    e.printStackTrace();
>             }
> 
> Can someone help us her?
> 
> Best regards,
> 
> Klaus Henning
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com 
> ______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com ______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: how to create structure for an existing PDF document

Posted by Olaf Drümmer <ol...@callassoftware.com>.
Hi Klaus,

what kind of structure do you wish to create? Structure in the sense of tagged PDF, or just some logical structure, and if so, for what purposes?

Olaf


On 12 Mar 2015, at 11:54, "Henning, Klaus" <KH...@eitco.de> wrote:

> Hi,
> 
> we want to create the structure to an existing PDF document. We have PDF documents from a scanner which contains Images but no structure.
> We want to implement a program to create the structure so we can add AlternateDescriptions to the images based on tesaract ocr recognition.
> 
> Our first approach creates a structure but the structure seems to be incomplete when checking it with adobe acrobat. We can't find any hints in the pdfbox examples
> or documentation how to do this.
> 
> Our Code snippet:
> 
>             try {
>                    PDDocument document = PDDocument.load("test.pdf");
>                    PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
> 
>                    PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
> 
>                    if(treeRoot == null){
>                           COSDictionary cosDictionary = documentCatalog.getCOSDictionary();
>                           PDStructureTreeRoot newTreeRoot = new PDStructureTreeRoot();
> 
>                           //iterate over pages
>                           List<?> pages = documentCatalog.getAllPages();
>                           for (Object object : pages) {
>                                  PDPage page = (PDPage) object;
>                                  Map<String,PDXObject> mapObjects = page.getResources().getXObjects();
>                                  for (PDXObject pdxObject : mapObjects.values()) {
>                                        if(pdxObject instanceof PDXObjectImage){
>                                               PDXObjectImage objectImage = (PDXObjectImage)pdxObject;
>                                               //new SturctureElement for the image
>                                               PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure,newTreeRoot);
>                                               PDMarkedContent markedContent = new PDMarkedContent(COSName.IMAGE,  new COSDictionary());
>                                               markedContent.addXObject(objectImage);
>                                               structureElement.appendKid(markedContent);
>                                               structureElement.setAlternateDescription("NEW ALTERNATE DESCRIPTION");
>                                               newTreeRoot.appendKid(structureElement);
>                                        }
>                                  }
>                           }
> 
>                           documentCatalog.setStructureTreeRoot(newTreeRoot);
>                           treeRoot = documentCatalog.getStructureTreeRoot();
>                    }
> 
>                    document.save("testWithTree.pdf");
>                    document.close();
>             }
>             catch (IOException e) {
>                    e.printStackTrace();
>             }
>             catch (COSVisitorException e) {
>                    e.printStackTrace();
>             }
> 
> Can someone help us her?
> 
> Best regards,
> 
> Klaus Henning
> 
> 
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org