You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Martijn Brinkers (JIRA)" <ji...@apache.org> on 2010/11/22 16:13:13 UTC

[jira] Commented: (PDFBOX-899) OutOfMemoryError with PDFTextStripper

    [ https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934483#action_12934483 ] 

Martijn Brinkers commented on PDFBOX-899:
-----------------------------------------

I don't think the OOM is cause by a leak. The OOM happens because the PDF contains a large number of fonts and the font cache does not have a sane upper limit. I think the font cache should have some sane upper limit and stop caching the fonts if the cache already contains the max number of fonts. I have added a patch to set an upper limit. I'm not sure what the best default upper limit should be so I have used 100. The upper limit can be set using the system property -Dpdfontfactory=123.

Because the fonts are only cached, I think the only downside of not caching is that parsing will be slower if the cache is already full.  Instead of setting an upper limit, it might be nicer to use some kind of cache that can detect which fonts are last used and remove the ones that are no longer used.

> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
>                 Key: PDFBOX-899
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>         Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
>            Reporter: Alexander Veit
>            Priority: Critical
>         Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an application server by requiring esimated aditional 300MB+ of heap memory. The heap dump suggests that PDFStreamEngine#documentFontCache might be the root of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.