You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/12 20:11:36 UTC

[jira] [Commented] (PDFBOX-2200) Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects

    [ https://issues.apache.org/jira/browse/PDFBOX-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208483#comment-14208483 ] 

Tim Allison commented on PDFBOX-2200:
-------------------------------------

[~alanbur] recently pointed out on TIKA-1471 that running clearResources() in a multithreaded environment is a bad idea.  Would it make sense (shudder) to make cmapObjects ThreadLocal?  Or is there another recommendation for  what we should do until 2.0 is released if we're running PDFBox in multiple threads in a long running process?

> Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-2200
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2200
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Matthew Buckett
>             Fix For: 2.0.0
>
>
> We use Tika to extract text from a large number (10,000+) of PDFs in a long running JVM, after doing this for a while we started running short of heap space. A heap dump shows that about 717MB of heap is retained through org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects and the hashmap has 18001 entries.
> PDFBOX-1009 looked to partially address this but it appears the symptons are still present. As a workaround I'm going to manually call             PDFont.clearResources() after indexing each document to prevent this happening, but it would be better if I didn't have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)