You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Nicholas DiPiazza (JIRA)" <ji...@apache.org> on 2017/07/05 20:09:00 UTC

[jira] [Commented] (PDFBOX-3856) Non-large PDF's can cause Out of Memory Exceptions

    [ https://issues.apache.org/jira/browse/PDFBOX-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075323#comment-16075323 ] 

Nicholas DiPiazza commented on PDFBOX-3856:
-------------------------------------------

Steps to reproduce: 

Create a google drive spreadsheet with a few hundred sheets with 1000's of columns each sheet. 
Export to PDF.
Try to load this into PDFBox. Notice failure. 

> Non-large PDF's can cause Out of Memory Exceptions
> --------------------------------------------------
>
>                 Key: PDFBOX-3856
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3856
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.1
>            Reporter: Nicholas DiPiazza
>            Priority: Blocker
>         Attachments: Pasted image at 2017_07_05 02_26 PM.png
>
>
> Tika version: 1.13
> PDFBox Version: 2.0.1
> We are using an application that attempts to make PDFs searchable using Apache Tika which in downstream uses PDF Box to parse PDFs to extract the body of a PDF in text to make it searchable. 
> We allow basically any PDF from anywhere to come in as long as it isn't too large (9 MB).
> However, we are noticing some PDFs, even though they are not that large in file size, can cause zip bombs to eat up all the heap space and crash the JVM.
> There is some sort of Object[] array that has millions of {code}org.apache.pdfbox.text.TextPosition{code}
> Here is a snapshot of the heapdump: https://issues.apache.org/jira/secure/attachment/12875808/Pasted%20image%20at%202017_07_05%2002_26%20PM.png
> Is there a setting to limit the size of this particular array so that it doesn't cause a memory bomb?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org