You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Alexander Veit (JIRA)" <ji...@apache.org> on 2010/11/22 13:45:16 UTC

[jira] Created: (PDFBOX-899) OutOfMemoryError with PDFTextStripper

OutOfMemoryError with PDFTextStripper
-------------------------------------

                 Key: PDFBOX-899
                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.3.1
         Environment: java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
            Reporter: Alexander Veit
            Priority: Critical


PDFBox 1.3.1 has high memory demands when stripping text from PDF files.

http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an application server by requiring esimated aditional 300MB+ of heap memory. The heap dump suggests that PDFStreamEngine#documentFontCache might be the root of the leaking objects.

PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-899) OutOfMemoryError with PDFTextStripper

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987581#action_12987581 ] 

Andreas Lehmkühler commented on PDFBOX-899:
-------------------------------------------

The extraction works fine with the current trunk version (rev. 1063402) without applying the patch. Probably my recent changes on the font stuff accidentily eliminated a memory leak and/or improved the memory consumption. Can you confirm that behaviour?

P.S.: According to the document properties it isn't allowed to extract the text .... ;-)

> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
>                 Key: PDFBOX-899
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>         Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
>            Reporter: Alexander Veit
>            Priority: Critical
>         Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an application server by requiring esimated aditional 300MB+ of heap memory. The heap dump suggests that PDFStreamEngine#documentFontCache might be the root of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-899) OutOfMemoryError with PDFTextStripper

Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934483#action_12934483 ] 

Martijn Brinkers commented on PDFBOX-899:
-----------------------------------------

I don't think the OOM is cause by a leak. The OOM happens because the PDF contains a large number of fonts and the font cache does not have a sane upper limit. I think the font cache should have some sane upper limit and stop caching the fonts if the cache already contains the max number of fonts. I have added a patch to set an upper limit. I'm not sure what the best default upper limit should be so I have used 100. The upper limit can be set using the system property -Dpdfontfactory=123.

Because the fonts are only cached, I think the only downside of not caching is that parsing will be slower if the cache is already full.  Instead of setting an upper limit, it might be nicer to use some kind of cache that can detect which fonts are last used and remove the ones that are no longer used.

> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
>                 Key: PDFBOX-899
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>         Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
>            Reporter: Alexander Veit
>            Priority: Critical
>         Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an application server by requiring esimated aditional 300MB+ of heap memory. The heap dump suggests that PDFStreamEngine#documentFontCache might be the root of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-899) OutOfMemoryError with PDFTextStripper

Posted by "Martijn Brinkers (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Martijn Brinkers updated PDFBOX-899:
------------------------------------

    Attachment: PDFBOX-899.patch

Patch to set and upper limit on the number of cached fonts

> OutOfMemoryError with PDFTextStripper
> -------------------------------------
>
>                 Key: PDFBOX-899
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-899
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>         Environment: java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
> Java HotSpot(TM) Client VM (build 17.1-b03, mixed mode)
>            Reporter: Alexander Veit
>            Priority: Critical
>         Attachments: PDFBOX-899.patch
>
>
> PDFBox 1.3.1 has high memory demands when stripping text from PDF files.
> http://www.unicode.org/Public/5.1.0/charts/CodeCharts.pdf even crashes an application server by requiring esimated aditional 300MB+ of heap memory. The heap dump suggests that PDFStreamEngine#documentFontCache might be the root of the leaking objects.
> PDFBox 1.0.0 did not show this behaviour. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.