You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Anca Zapuc (Created) (JIRA)" <ji...@apache.org> on 2012/02/09 23:02:57 UTC

[jira] [Created] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Counting  pages of a PDF gives OutOfMemoryError
-----------------------------------------------

                 Key: PDFBOX-1226
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
             Project: PDFBox
          Issue Type: Bug
          Components: PDFReader
    Affects Versions: 1.6.0
         Environment: Windows 7 / Windows XP
            Reporter: Anca Zapuc


I have a pdf ( 397 MB) and I am trying to count the pages.
I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
Code:
  PDDocument doc = null;
	        File temp = null;
	        RandomAccessFile rand = null;
	        int nr = 0;
	        try {
	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
	            temp = new File("e:/temp.tmp");
	            //using random access file needed for PDF really large
	            rand = new RandomAccessFile(temp,"rw");
	            doc = PDDocument.load(file,rand);
	            nr = doc.getNumberOfPages();
	}catch(Exception e){
		e.printStackTrace();
	}

Got following exception:
org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
	at PDFBoxExample.main(PDFBoxExample.java:258)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
	... 4 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Timo Boehme (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme closed PDFBOX-1226.
-------------------------------

       Resolution: Not A Problem
    Fix Version/s: 1.6.0

I close this issue as 'Not A Problem' because this is simply a case of a complex file which needs enough heap space available in order to be parsed. Thus this is not a bug but a restriction of the current parser which may be lifted if PDFBOX-1000 is ready.
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>            Assignee: Timo Boehme
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Adam Nichols (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205123#comment-13205123 ] 

Adam Nichols edited comment on PDFBOX-1226 at 2/10/12 12:53 AM:
----------------------------------------------------------------

There are a few different options here, the easiest and fastest would be to increase the amount of memory available to your JVM.  If you need to get code that works straight away, you should go this route.

However, counting pages shouldn't require a lot of memory (nor even reading the entire file for that matter).  PDFBOX-1000 is tracking a new parser which isn't done yet, but it might actually be far enough along to count pages (it's been a while since I had the code open).  This parser will probably take quite a while to complete, so there's also PDFBOX-1199 which is a shorter term solution.  Sorry I can't dig into the code right now and give you a more certain answer.  Hopefully the references to the other JIRA issues will be enough to help you out.

Also, if you're not using the getNumberOfPages() method, take a look at PDFBOX-911 for some very simple sample code.
                
      was (Author: adamnichols):
    There are a few different options here, the easiest and fastest would be to increase the amount of memory available to your JVM.  If you need to get code that works straight away, you should go this route.

However, counting pages shouldn't require a lot of memory (nor even reading the entire file for that matter).  PDFBOX-1000 is tracking a new parser which isn't done yet, but it might actually be far enough along to count pages (it's been a while since I had the code open).  This parser will probably take quite a while to complete, so there's also PDFBOX-1199 which is a shorter term solution.  Sorry I can't dig into the code right now and give you a more certain answer.  Hopefully 

Also, if you're not using the getNumberOfPages() method, take a look at PDFBOX-911 for some very simple sample code.
                  
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Anca Zapuc (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anca Zapuc updated PDFBOX-1226:
-------------------------------


Can you tell me where can I download PDFBOX 1.7.0 ?
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Timo Boehme (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme reassigned PDFBOX-1226:
-----------------------------------

    Assignee: Timo Boehme
    
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>            Assignee: Timo Boehme
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Anca Zapuc (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anca Zapuc updated PDFBOX-1226:
-------------------------------

    Attachment: Big_no_pages.7z

the pdf attached
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Timo Boehme (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timo Boehme updated PDFBOX-1226:
--------------------------------

    Component/s: Parsing
       Priority: Minor  (was: Major)
    
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>            Assignee: Timo Boehme
>            Priority: Minor
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205594#comment-13205594 ] 

Timo Boehme commented on PDFBOX-1226:
-------------------------------------

The file is quite special in that it contains nearly 2 Mill. objects and 925354 pages. With the current parsers it will be read completely until one can use the API to get the page count. This is also true for PDFBOX-1199 because for the compatibility with the existing code base it has to parse all objects (however: there is a 'parseMinimalCatalog' mode which can be used in this case to parse only a smaller number of objects). So far PDFBOX-1199 has not bean landed because encryption is currently not supported. You would have to build your own library using SVN and the files in PDFBOX-1199.

For the time being in order to parse the sample file you will need approx. 2GB of heap space (tested on my machine, took 153 seconds to parse and return page count). With lower amount of memory GC will take most of the processing time.
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Anca Zapuc (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anca Zapuc updated PDFBOX-1226:
-------------------------------

    Description: 
I have a pdf ( 397 MB) and I am trying to count the pages.
I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
Code:
  PDDocument doc = null;
	        File temp = null;
	        RandomAccessFile rand = null;
	        int nr = 0;
	        try {
	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
	            temp = new File("e:/temp.tmp");
	            //using random access file needed for PDF really large
	            rand = new RandomAccessFile(temp,"rw");
	            doc = PDDocument.load(file,rand);
	            nr = doc.getNumberOfPages();
	}catch(Exception e){
		e.printStackTrace();
	}

Got following exception:
org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
	at PDFBoxExample.main(PDFBoxExample.java:258)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
	... 4 more

I attached the PDF.

  was:
I have a pdf ( 397 MB) and I am trying to count the pages.
I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
Code:
  PDDocument doc = null;
	        File temp = null;
	        RandomAccessFile rand = null;
	        int nr = 0;
	        try {
	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
	            temp = new File("e:/temp.tmp");
	            //using random access file needed for PDF really large
	            rand = new RandomAccessFile(temp,"rw");
	            doc = PDDocument.load(file,rand);
	            nr = doc.getNumberOfPages();
	}catch(Exception e){
		e.printStackTrace();
	}

Got following exception:
org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
	at PDFBoxExample.main(PDFBoxExample.java:258)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
	... 4 more

    
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1226) Counting pages of a PDF gives OutOfMemoryError

Posted by "Adam Nichols (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205123#comment-13205123 ] 

Adam Nichols commented on PDFBOX-1226:
--------------------------------------

There are a few different options here, the easiest and fastest would be to increase the amount of memory available to your JVM.  If you need to get code that works straight away, you should go this route.

However, counting pages shouldn't require a lot of memory (nor even reading the entire file for that matter).  PDFBOX-1000 is tracking a new parser which isn't done yet, but it might actually be far enough along to count pages (it's been a while since I had the code open).  This parser will probably take quite a while to complete, so there's also PDFBOX-1199 which is a shorter term solution.  Sorry I can't dig into the code right now and give you a more certain answer.  Hopefully 

Also, if you're not using the getNumberOfPages() method, take a look at PDFBOX-911 for some very simple sample code.
                
> Counting  pages of a PDF gives OutOfMemoryError
> -----------------------------------------------
>
>                 Key: PDFBOX-1226
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1226
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDFReader
>    Affects Versions: 1.6.0
>         Environment: Windows 7 / Windows XP
>            Reporter: Anca Zapuc
>         Attachments: Big_no_pages.7z
>
>
> I have a pdf ( 397 MB) and I am trying to count the pages.
> I am able to open the PDF with AdobeReader 9, but no with FoxitReader.
> Code:
>   PDDocument doc = null;
> 	        File temp = null;
> 	        RandomAccessFile rand = null;
> 	        int nr = 0;
> 	        try {
> 	            //create a temporary file needed by the PDFBox when dealing with PDFs really really large
> 	            temp = new File("e:/temp.tmp");
> 	            //using random access file needed for PDF really large
> 	            rand = new RandomAccessFile(temp,"rw");
> 	            doc = PDDocument.load(file,rand);
> 	            nr = doc.getNumberOfPages();
> 	}catch(Exception e){
> 		e.printStackTrace();
> 	}
> Got following exception:
> org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1022)
> 	at PDFBoxExample.getHugeNrOfFiles(PDFBoxExample.java:36)
> 	at PDFBoxExample.main(PDFBoxExample.java:258)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> 	at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:45)
> 	at java.lang.StringBuffer.<init>(StringBuffer.java:79)
> 	at org.apache.pdfbox.pdfparser.BaseParser.readString(BaseParser.java:1121)
> 	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:402)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
> 	... 4 more
> I attached the PDF.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira