You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/06 19:48:48 UTC

[jira] Resolved: (PDFBOX-695) COSStream doesn't actually stream tokens, causing OOM in larger PDF text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-695.
---------------------------------------

    Fix Version/s: 1.2.0
       Resolution: Fixed

I've made a slight modification to the provided patch. Instead of using a COSStream the PDFStreamEngine uses the PDFStreamParser directly.

I've applied the patch with version 941826.

Thanks to Kyle for the contribution.


> COSStream doesn't actually stream tokens, causing OOM in larger PDF text extraction
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-695
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-695
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.2.0
>         Environment: All
>            Reporter: Kyle Maxwell
>             Fix For: 1.2.0
>
>         Attachments: pdfbox-oom-against-935604.patch
>
>
> Text extraction of certain pdfs has been hanging and/or OOMing.  Profiling revealed that PDFStreamEngine.processSubStream() eventually calls PDFStreamParser.getTokens(), which assembles an ArrayList of Tokens.  In some cases, this can use over 1GB of memory.
> The attached patch replaces PDFStreamParser.getTokens() with PDFStreamParser.getTokensIterator(), which streams the tokens, avoiding the ArrayList build.  It only uses this in the call path of org.apache.pdfbox.ExtractText, so the fix may not benefit other usages.  Also, API used by the fix may not be ideal.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.