You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Michael Goddard (JIRA)" <ji...@apache.org> on 2014/10/22 11:24:34 UTC

[jira] [Commented] (PDFBOX-1907) Out of memory - COSDocument (RandomAccessBuffer)

    [ https://issues.apache.org/jira/browse/PDFBOX-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14179752#comment-14179752 ] 

Michael Goddard commented on PDFBOX-1907:
-----------------------------------------

On a project using Apache Tika 1.6, we hit this issue with a particular PDF file.  To check I attempted to extract text using PDFBox alone, as shown below, and observe the same issue.  Adobe Acrobat Reader is able to extract the text from this document.  Are there any thoughts on how to solve for this other than avoiding certain problematic PDF files?

Here's what I observed:

[Downloads]$ java -Xmx1g -jar pdfbox-app-1.8.7.jar ExtractText -console -encoding UTF-8 ./Apache_Solr_4.7_Ref_Guide.pdf 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
  at java.util.ArrayList.<init>(ArrayList.java:168)
  at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:534)
  at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:591)
  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:258)
  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1233)
  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1198)
  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1123)
  at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212)
  at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
  at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)
Oct 22, 2014 5:04:13 AM org.apache.pdfbox.cos.COSDocument finalize
WARNING: Warning: You did not close a PDF Document


> Out of memory - COSDocument (RandomAccessBuffer)
> ------------------------------------------------
>
>                 Key: PDFBOX-1907
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1907
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.4
>         Environment: windows xp 64
> jdk 8 32 bit
>            Reporter: Jim Kay
>            Assignee: Andreas Lehmkühler
>              Labels: regression
>         Attachments: 8283.zip.001, 8283.zip.002, 8283.zip.003
>
>
> Possibly related to PDFBOX-1777.
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.AbstractCollection.toArray(AbstractCollection.java:136)
> 	at java.util.ArrayList.<init>(ArrayList.java:168)
> 	at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
> 	at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
> 	at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
> 	at techref.Testpdfbox.main(Testpdfbox.java:36)
> The heap space is set to -Xmx1640m
> The pdf docoument is parsed OK with version 1.8.3 but fails with 1.8.4
> The large pdf document has the following attributes.
> pdDoc.getCurrentAccessPermission.canExtractContent = true
> pdDoc.getCurrentAccessPermission.canExtractForAccessibility = true
> pdDoc.getNumberOfPages = 228
> pdDoc.getDocumentCatalog.getLanguage = null
> pdDoc.getDocumentCatalog.getPageLayout = SinglePage
> pdDoc.getDocumentCatalog.getPageMode = UseNone
> pdDoc.getDocumentCatalog.getVersion = null
> Page Count=228
> Title=Microsoft Word - FEA.doc
> Author=null
> Subject=null
> Keywords=null
> Creator=Windows NT 4.0
> Producer=Acrobat Distiller 4.05 for Windows
> Creation Date=Fri Jun 29 15:29:59 BST 2001
> Modification Date=Mon Jul 02 15:41:18 BST 2001
> Trapped=null
> Dictionary=COSDictionary{(COSName{CreationDate}:COSString{D:20010629142959}) (COSName{Producer}:COSString{Acrobat Distiller 4.05 for Windows}) (COSName{Creator}:COSString{Windows NT 4.0}) (COSName{Title}:COSString{Microsoft Word - FEA.doc}) (COSName{ModDate}:COSString{D:20010702164118+02'00'}) }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)