You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/02/06 20:16:59 UTC

[jira] Resolved: (PDFBOX-313) OutOfMemoryError for larger PDF text extraction

     [ https://issues.apache.org/jira/browse/PDFBOX-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-313.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator

With version 741680 a suitable key is used for caching as Daniel suggested. Finally every works fine.

> OutOfMemoryError for larger PDF text extraction
> -----------------------------------------------
>
>                 Key: PDFBOX-313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Priority: Minor
>             Fix For: 0.8.0-incubator
>
>         Attachments: Fix_for_PDFBOX-313.patch
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1805929
> Originally submitted by tdonohue on 2007-10-01 13:51.
> Hello,
> I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org) version 1.4.2.   Currently, I'm running into OutOfMemoryError exceptions whenever I attempt text extraction from a few larger PDFs (>10MB).  I've also just tried replacing PDFBox 0.7.3 with your latest nightly-build (from Oct 1), and the error still seems to be happening.
> My JVM options are currently set to:
> -Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
> Here's a few of the problem PDFs:
> 15MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf
> 13MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF
> Here's an example error stacktrace:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.util.HashMap.addEntry(HashMap.java:753)
>         at java.util.HashMap.put(HashMap.java:385)
>         at org.fontbox.cmap.CMap.addMapping(CMap.java:131)
>         at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202)
>         at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
>         at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
>         at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:343)
>         at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
>         at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:497)
>         at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:218)
>         at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
>         at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
>         at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
>         at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
>         at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
>         at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
>         at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
>         at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
>         at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
>         at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
>         at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:417)
>         at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
> Finally, here's how the DSpace API is calling PDFBox:
>         PDFTextStripper pts = new PDFTextStripper();
>         PDFParser parser = null;
>         String extractedText = null;
>         try
>         {
>             parser = new PDFParser(source);
>         parser.parse();
>             extractedText = pts.getText(new PDDocument(parser.getDocument()));
>         }
>         finally
>         {
>             try
>             {
>                 parser.getDocument().close();
>             }
>             catch(Exception e)
>             {
>                log.error("Error closing temporary PDF file: " + e.getMessage(), e);
>             }
>         }
> [comment on SourceForge]
> Originally sent by tdonohue.
> Logged In: YES 
> user_id=1320825
> Originator: YES
> I neglected to mention both of these PDFs were initially image-based and were recently OCRed using Adobe Acrobat 8 Pro.  I'm not sure that would matter for PDFBox to perform text extraction, but it's another commonality between these PDFs.
> Thanks in advance for any help you can provide!
> - Tim

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.