You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mario Sangiorgio (JIRA)" <ji...@apache.org> on 2010/07/16 23:28:52 UTC

[jira] Created: (PDFBOX-778) OutOfMemory when extracting text from pdf

OutOfMemory when extracting text from pdf
-----------------------------------------

                 Key: PDFBOX-778
                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
             Project: PDFBox
          Issue Type: Bug
         Environment: Mac OS X
            Reporter: Mario Sangiorgio


I have to extract text from hundreds of documents, but at a certain point I get an out of memory exception.
It seems that the memory leak is related to a single file that I attached.

Please let me know if you need more details.

This is the stacktrace of the exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:2734)
	at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
	at java.util.ArrayList.add(ArrayList.java:351)
	at org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
	at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
	at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
	at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-778) OutOfMemory when extracting text from pdf

Posted by "Mario Sangiorgio (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889487#action_12889487 ] 

Mario Sangiorgio commented on PDFBOX-778:
-----------------------------------------

I was using PDFBoc 1.1.0, I'm going to update it right now and if I still have troubles I'll let you know. Thanks for the reply!

> OutOfMemory when extracting text from pdf
> -----------------------------------------
>
>                 Key: PDFBOX-778
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Mac OS X
>            Reporter: Mario Sangiorgio
>         Attachments: 92.pdf
>
>
> I have to extract text from hundreds of documents, but at a certain point I get an out of memory exception.
> It seems that the memory leak is related to a single file that I attached.
> Please let me know if you need more details.
> This is the stacktrace of the exception:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2734)
> 	at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
> 	at java.util.ArrayList.add(ArrayList.java:351)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
> 	at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
> 	at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
> 	at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-778) OutOfMemory when extracting text from pdf

Posted by "Mario Sangiorgio (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889487#action_12889487 ] 

Mario Sangiorgio edited comment on PDFBOX-778 at 7/17/10 6:25 AM:
------------------------------------------------------------------

I was using PDFBox 1.1.0, I'm going to update it right now and if I still have troubles I'll let you know. Thanks for the reply!

      was (Author: mariosangiorgio):
    I was using PDFBoc 1.1.0, I'm going to update it right now and if I still have troubles I'll let you know. Thanks for the reply!
  
> OutOfMemory when extracting text from pdf
> -----------------------------------------
>
>                 Key: PDFBOX-778
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Mac OS X
>            Reporter: Mario Sangiorgio
>         Attachments: 92.pdf
>
>
> I have to extract text from hundreds of documents, but at a certain point I get an out of memory exception.
> It seems that the memory leak is related to a single file that I attached.
> Please let me know if you need more details.
> This is the stacktrace of the exception:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2734)
> 	at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
> 	at java.util.ArrayList.add(ArrayList.java:351)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
> 	at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
> 	at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
> 	at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-778) OutOfMemory when extracting text from pdf

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889486#action_12889486 ] 

Jukka Zitting commented on PDFBOX-778:
--------------------------------------

Which version of PDFBox are you using? With PDFBox 1.2.1 I'm able to extract the text of this document with only 16MB of heap.

    $ java -Xmx16m -jar pdfbox-app-1.2.1.jar ExtractText 92.pdf
    $ wc 92.txt
      513  4792 35080 92.txt
    $ java -version
    java version "1.6.0_20"
   Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
    Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)


> OutOfMemory when extracting text from pdf
> -----------------------------------------
>
>                 Key: PDFBOX-778
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Mac OS X
>            Reporter: Mario Sangiorgio
>         Attachments: 92.pdf
>
>
> I have to extract text from hundreds of documents, but at a certain point I get an out of memory exception.
> It seems that the memory leak is related to a single file that I attached.
> Please let me know if you need more details.
> This is the stacktrace of the exception:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2734)
> 	at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
> 	at java.util.ArrayList.add(ArrayList.java:351)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
> 	at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
> 	at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
> 	at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-778) OutOfMemory when extracting text from pdf

Posted by "David Wright (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905493#action_12905493 ] 

David Wright commented on PDFBOX-778:
-------------------------------------

Hi Jukka

I found this thread because I also had a bad memory leak with 1.1.0.
Upgrading to 1.2.1 has improved matters a lot but  there still seems to
be a small leak. 

I'm extracting text from ~1000 PDFs, ~700MB containing ~40MB of text;
some are old and poorly constructed and with 1.1.0 generated a lot of
complaints from PDFontFactory et al.  Memory use changes as follows:

added=0 before gc MB=2.88, after MB=1.88
added=1 before gc MB=13.5, after MB=9.38
added=501 before gc MB=23.0, after MB=9.75
added=551 before gc MB=25.7, after MB=15.3
added=926 before gc MB=42.6, after MB=23.6
added=1076 before gc MB=45.4, after MB=18.9

Hope this info is of some use. Generally I'm delighted with PDFBox, it
'does what it says on the tin'.

Kind regards

David Wright
Technical Author
LDS Test and Measurement, Royston Herts UK SG8 5BQ
Direct Dial +44 1763 255235


This e-mail is confidential and may be read, copied and used only by the intended recipient. If you have received it in error, please contact the sender immediately by return e-mail. Please then delete the e-mail and do not disclose its contents to any other person.

LDS Test & Measurement Ltd is registered in England and its registration number is 01539186. The registered Office of LDS Test & Measurement Ltd is Jarman Way, Royston, Herts, SG8 5BQ, England.


> OutOfMemory when extracting text from pdf
> -----------------------------------------
>
>                 Key: PDFBOX-778
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-778
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Mac OS X
>            Reporter: Mario Sangiorgio
>         Attachments: 92.pdf
>
>
> I have to extract text from hundreds of documents, but at a certain point I get an out of memory exception.
> It seems that the memory leak is related to a single file that I attached.
> Please let me know if you need more details.
> This is the stacktrace of the exception:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at java.util.Arrays.copyOf(Arrays.java:2734)
> 	at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
> 	at java.util.ArrayList.add(ArrayList.java:351)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:103)
> 	at org.apache.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:119)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:207)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:57)
> 	at it.polimi.utils.TextStripper.getFullText(TextStripper.java:72)
> 	at it.polimi.utils.TextStripper.getContent(TextStripper.java:30)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:47)
> 	at applications.ExtractAbstracts.convert(ExtractAbstracts.java:36)
> 	at applications.ExtractAbstracts.main(ExtractAbstracts.java:17)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.