You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2016/03/22 18:41:25 UTC
[jira] [Commented] (PDFBOX-3284) Big Pdf parsing to text - Out of memory

    [ https://issues.apache.org/jira/browse/PDFBOX-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206862#comment-15206862 ] 

Tilman Hausherr commented on PDFBOX-3284:
-----------------------------------------

PDFBox doesn't parse on demand. Thus many structures that aren't needed are parsed and expanded and do exist in memory, e.g. the many annotations (links) in the file. We've had complains about more memory usage than your complaint :-) Btw you can save a little bit of memory by using FIle instead of stream, and by using a scratch file.

> Big Pdf parsing to text - Out of memory
> ---------------------------------------
>
>                 Key: PDFBOX-3284
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3284
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.10, 1.8.11, 2.0.0, 2.1.0
>            Reporter: Nicolas Daniels
>
> I'm trying to parse a quite big PDF (26MB) and transform it to text, however I'm facing a huge memory consumption leading to out of memory error. Running my test with -Xmx768M will always fail. I've to increase to 1500M to make it work. 
> The resulting text is only 3MB so I don't understand why it is taking so much memory.
> I've tested this code over 1.8.10, 1.8.11 & 2.0.0 with same result.
> The pdf can be found [here|https://www2.swift.com/uhbonline/books/public/en_uk/clr_3_0_stdsmx_msg_def_rpt_sch/sr2015_mx_clearing_3dot0_mdr2_solution.pdf]
> My code:
> {code:title=Test.java|borderStyle=solid}
> @Test
> public void testParsePdf_Content_Memory() throws Exception {
> {
>     InputStream inputStream = getClass().getResourceAsStream("c:/tmp/sr2015_mx_clearing_3dot0_mdr2_solution.pdf");
>     try {
>              StringWriter writer = new StringWriter();
> 	     FileWriter fileWriter = new FileWriter(new File("c:/tmp/test.txt"));
>              PDFTextStripper pdfTextStripper = new PDFTextStripper();
> 	     pdfTextStripper.writeText(PDDocument.load(inputStream), fileWriter);
>              fileWriter.close();
>     } finally {
>         inputStream.close();
>     }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org