You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Yuguang Huang (Jira)" <ji...@apache.org> on 2020/01/09 19:29:00 UTC

[jira] [Updated] (PDFBOX-4738) Pages do not have objects after splitting

     [ https://issues.apache.org/jira/browse/PDFBOX-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuguang Huang updated PDFBOX-4738:
----------------------------------
    Description: 
 

Hi PDFBOX community, we want to get objs count on pages instead of the whole document. 

Our way to do it is splitting the whole document into multiple documents containing only one page. But it seems then it returns documents/pages without objects, meaning getDocument().getObjects() returns an empty list. 

But if we save each page into bytes then load them into PDDocument, we are able to get the object counts. 

 

Is there any way we can get the page objs count without involving so much IO? Thanks! 

 

Output of the below code with a three-page PDF document:

 

Page objects count from splitted pages:
page [1] num of objs [0]
page [2] num of objs [0]
page [3] num of objs [0]
Page objects count from pages generated from bytes:
page [1] num of objs [20]
page [2] num of objs [51]
page [3] num of objs [20]

 
{code:java}
private static void printNumObjects(String pdfFilename) throws IOException {
 byte[] fileContent = Files.readAllBytes((new File(pdfFilename)).toPath());
 PDDocument document = PDDocument.load(fileContent);
 List<PDDocument> pages = new Splitter().split(document);
 List<byte[]> pageBytes = pages.stream().map(page -> {
 try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
 page.save(baos);
 page.close();
 return baos.toByteArray();
 } catch (IOException e) {
 LOG.error("Failed to get bytes from page.", e);
 return new byte[0];
 }
 }).collect(Collectors.toList());

 System.out.println("Page objects count from splitted pages:");
 IntStream.range(0, pages.size()).forEach(i -> System.out.println(String.format("page [%d] num of objs [%d]", i + 1, pages.get(i).getDocument().getObjects().size())));

 System.out.println("Page objects count from pages generated from bytes:");
 IntStream.range(0, pageBytes.size()).forEach(i -> {
 try {
 System.out.println(String.format("page [%d] num of objs [%d]", i + 1, PDDocument.load(pageBytes.get(i)).getDocument().getObjects().size()));
 } catch (IOException e) {
 LOG.error("Failed to load page.", e);
 }
 });
}{code}
 

 

  was:
 

Hi PDFBOX community, we want to get objs count on pages instead of the whole document. 

Our way to do it is splitting the whole document into multiple documents containing only one page. But it seems then it returns documents/pages without objects, getDocument().getObjects() returns an empty list. 

But if we save each page into bytes then load them into PDDocument, we are able to get the object counts. 

 

Is there any way we can get the page objs count without involving so much IO? Thanks! 

 
{code:java}
private static void printNumObjects(String pdfFilename) throws IOException {
 byte[] fileContent = Files.readAllBytes((new File(pdfFilename)).toPath());
 PDDocument document = PDDocument.load(fileContent);
 List<PDDocument> pages = new Splitter().split(document);
 List<byte[]> pageBytes = pages.stream().map(page -> {
 try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
 page.save(baos);
 page.close();
 return baos.toByteArray();
 } catch (IOException e) {
 LOG.error("Failed to get bytes from page.", e);
 return new byte[0];
 }
 }).collect(Collectors.toList());

 System.out.println("Page objects count from splitted pages:");
 IntStream.range(0, pages.size()).forEach(i -> System.out.println(String.format("page [%d] num of objs [%d]", i + 1, pages.get(i).getDocument().getObjects().size())));

 System.out.println("Page objects count from pages generated from bytes:");
 IntStream.range(0, pageBytes.size()).forEach(i -> {
 try {
 System.out.println(String.format("page [%d] num of objs [%d]", i + 1, PDDocument.load(pageBytes.get(i)).getDocument().getObjects().size()));
 } catch (IOException e) {
 LOG.error("Failed to load page.", e);
 }
 });
}{code}
 

 


> Pages do not have objects after splitting
> -----------------------------------------
>
>                 Key: PDFBOX-4738
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4738
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Yuguang Huang
>            Priority: Major
>
>  
> Hi PDFBOX community, we want to get objs count on pages instead of the whole document. 
> Our way to do it is splitting the whole document into multiple documents containing only one page. But it seems then it returns documents/pages without objects, meaning getDocument().getObjects() returns an empty list. 
> But if we save each page into bytes then load them into PDDocument, we are able to get the object counts. 
>  
> Is there any way we can get the page objs count without involving so much IO? Thanks! 
>  
> Output of the below code with a three-page PDF document:
>  
> Page objects count from splitted pages:
> page [1] num of objs [0]
> page [2] num of objs [0]
> page [3] num of objs [0]
> Page objects count from pages generated from bytes:
> page [1] num of objs [20]
> page [2] num of objs [51]
> page [3] num of objs [20]
>  
> {code:java}
> private static void printNumObjects(String pdfFilename) throws IOException {
>  byte[] fileContent = Files.readAllBytes((new File(pdfFilename)).toPath());
>  PDDocument document = PDDocument.load(fileContent);
>  List<PDDocument> pages = new Splitter().split(document);
>  List<byte[]> pageBytes = pages.stream().map(page -> {
>  try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
>  page.save(baos);
>  page.close();
>  return baos.toByteArray();
>  } catch (IOException e) {
>  LOG.error("Failed to get bytes from page.", e);
>  return new byte[0];
>  }
>  }).collect(Collectors.toList());
>  System.out.println("Page objects count from splitted pages:");
>  IntStream.range(0, pages.size()).forEach(i -> System.out.println(String.format("page [%d] num of objs [%d]", i + 1, pages.get(i).getDocument().getObjects().size())));
>  System.out.println("Page objects count from pages generated from bytes:");
>  IntStream.range(0, pageBytes.size()).forEach(i -> {
>  try {
>  System.out.println(String.format("page [%d] num of objs [%d]", i + 1, PDDocument.load(pageBytes.get(i)).getDocument().getObjects().size()));
>  } catch (IOException e) {
>  LOG.error("Failed to load page.", e);
>  }
>  });
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org