You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael Goddard (JIRA)" <ji...@apache.org> on 2014/10/22 11:57:34 UTC
[jira] [Commented] (PDFBOX-2445) Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf

    [ https://issues.apache.org/jira/browse/PDFBOX-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179774#comment-14179774 ] 

Michael Goddard commented on PDFBOX-2445:
-----------------------------------------

On a project using Apache Tika 1.6, we hit this issue with a particular PDF file. To check I attempted to extract text using PDFBox alone, as shown below, and observe the same issue. Adobe Acrobat Reader is able to extract the text from this document. Are there any thoughts on how to solve for this other than avoiding certain problematic PDF files?

Here's what I observed:

[Downloads]$ java -Xmx1g -jar pdfbox-app-1.8.7.jar ExtractText -console -encoding UTF-8 ./Apache_Solr_4.7_Ref_Guide.pdf
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
at java.util.ArrayList.<init>(ArrayList.java:168)
at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:534)
at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:591)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:258)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1233)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1198)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1123)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Oct 22, 2014 5:04:13 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document

I couldn't find the "upload" button here, so uploaded this PDF to S3 along with the text produced by Adobe Acrobat Reader:

https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.pdf
https://s3.amazonaws.com/goddard.public/Apache_Solr_4.7_Ref_Guide.txt

Also, I attempted to use the non-sequential PDFBox parser from my code which uses Tika, but this didn't solve the problem:

PDFParserConfig pdfParserConfig = new PDFParserConfig();
pdfParserConfig.setUseNonSequentialParser(true);
context.set(PDFParserConfig.class, pdfParserConfig);


> Out of Memory - Extract text for Apache_Solr_4.7_Ref_Guide.pdf
> --------------------------------------------------------------
>
>                 Key: PDFBOX-2445
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2445
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Maruan Sahyoun
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)