You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by ya...@optonline.net on 2011/10/08 14:04:46 UTC

Problem trying to load 25 MB doc in PDFBox

I get an out of heap memory error trying to load a 25 MB doc using Apache's PDFBox. When I load a smaller doc, I have no problem. I have stripped the code down to just loading the doc and trying to print the number of pages. Loading a small doc works. I tried increasing the heap size. Here is the code:

import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;

public class LoadPDF {

private static String pdfFilename = "My24MBFile.pdf";
//private static String pdfFilename = "MyTinyFile.pdf";

public void runLoadPDF(String inPDF_Filename) {
PDDocument doc = null;
try {
System.out.println("Just BEFORE load Document");
doc = PDDocument.load(inPDF_Filename);
System.out.println("Just AFTER load Document");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("number of pages is: " + doc.getNumberOfPages() );

}

public static void main(String[] args){
LoadPDF readPDF = new LoadPDF();
readPDF.runLoadPDF(pdfFilename); 

} 
}

Here is the error from the system console in Eclipse:


Just BEFORE load Document
org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1069)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1036)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:961)
at LoadPDF.runLoadPDF(LoadPDF.java:13)
at LoadPDF.main(LoadPDF.java:25)
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
at org.apache.pdfbox.io.RandomAccessBuffer.write(RandomAccessBuffer.java:131)
at org.apache.pdfbox.io.RandomAccessFileOutputStream.write(RandomAccessFileOutputStream.java:108)
at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
at java.io.BufferedOutputStream.flush(Unknown Source)
at java.io.FilterOutputStream.close(Unknown Source)
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:448)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:552)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:184)
... 5 more
Exception in thread "main" java.lang.NullPointerException
at LoadPDF.runLoadPDF(LoadPDF.java:19)
at LoadPDF.main(LoadPDF.java:25)

-- It seems to me that the doc is being loaded into memory. If this is so and I can't even load a 25 MB doc, then I am in real trouble because we have much bigger docs to load (hundreds of MB). Does anyone know if this is analagous to parsing XML docs with the DOM parser? If so, is there an equivalent to the SAX parser in either PDFBox or any other PDF library?

Thanks in advance for any help or advice.

- Frank