You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (Jira)" <ji...@apache.org> on 2020/02/01 12:33:00 UTC

[jira] [Resolved] (PDFBOX-4738) getDocument().getObjects() returns nothing for split result documents

     [ https://issues.apache.org/jira/browse/PDFBOX-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr resolved PDFBOX-4738.
-------------------------------------
      Assignee: Tilman Hausherr
    Resolution: Fixed

> getDocument().getObjects() returns nothing for split result documents
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-4738
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4738
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Documentation
>    Affects Versions: 2.0.18
>            Reporter: Yuguang Huang
>            Assignee: Tilman Hausherr
>            Priority: Minor
>             Fix For: 2.0.19, 3.0.0 PDFBox
>
>
>  
> Hi PDFBOX community, we want to get objs count on pages instead of the whole document. 
> Our way to do it is splitting the whole document into multiple documents containing only one page. But it seems then it returns documents/pages without objects, meaning getDocument().getObjects() returns an empty list. 
> But if we save each page into bytes then load them into PDDocument, we are able to get the object counts. 
>  
> Is there any way we can get the page objs count without involving so much IO? Thanks! 
>  
> Output of the below code with a three-page PDF document:
>  
> Page objects count from splitted pages:
> page [1] num of objs [0]
> page [2] num of objs [0]
> page [3] num of objs [0]
> Page objects count from pages generated from bytes:
> page [1] num of objs [20]
> page [2] num of objs [51]
> page [3] num of objs [20]
>  
> {code:java}
> private static void printNumObjects(String pdfFilename) throws IOException {
>  byte[] fileContent = Files.readAllBytes((new File(pdfFilename)).toPath());
>  PDDocument document = PDDocument.load(fileContent);
>  List<PDDocument> pages = new Splitter().split(document);
>  List<byte[]> pageBytes = pages.stream().map(page -> {
>  try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
>  page.save(baos);
>  page.close();
>  return baos.toByteArray();
>  } catch (IOException e) {
>  LOG.error("Failed to get bytes from page.", e);
>  return new byte[0];
>  }
>  }).collect(Collectors.toList());
>  System.out.println("Page objects count from splitted pages:");
>  IntStream.range(0, pages.size()).forEach(i -> System.out.println(String.format("page [%d] num of objs [%d]", i + 1, pages.get(i).getDocument().getObjects().size())));
>  System.out.println("Page objects count from pages generated from bytes:");
>  IntStream.range(0, pageBytes.size()).forEach(i -> {
>  try {
>  System.out.println(String.format("page [%d] num of objs [%d]", i + 1, PDDocument.load(pageBytes.get(i)).getDocument().getObjects().size()));
>  } catch (IOException e) {
>  LOG.error("Failed to load page.", e);
>  }
>  });
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org