You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/08/04 19:42:45 UTC
[jira] Commented: (PDFBOX-313) OutOfMemoryError for larger PDF text
extraction
[ https://issues.apache.org/jira/browse/PDFBOX-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619610#action_12619610 ]
Jukka Zitting commented on PDFBOX-313:
--------------------------------------
[Comment on SourceForge]
Date: 2008-06-09 19:16
Sender: nobody
Logged In: NO
I getting the exact same exception:
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.resize(HashMap.java:462)
at java.util.HashMap.addEntry(HashMap.java:755)
at java.util.HashMap.put(HashMap.java:385)
at org.fontbox.cmap.CMap.addMapping(CMap.java:131)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:510)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:381)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:345)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:506)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:219)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:178)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
at
us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.addContent(NmdLucenePDFDocument.java:456)
I see that the problem is in FontBox. Is it an infinite loop or is there
just too much data to parse?
Please let me know where I can upload the pdf so
that you test this out.
James
> OutOfMemoryError for larger PDF text extraction
> -----------------------------------------------
>
> Key: PDFBOX-313
> URL: https://issues.apache.org/jira/browse/PDFBOX-313
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1805929
> Originally submitted by tdonohue on 2007-10-01 13:51.
> Hello,
> I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org) version 1.4.2. Currently, I'm running into OutOfMemoryError exceptions whenever I attempt text extraction from a few larger PDFs (>10MB). I've also just tried replacing PDFBox 0.7.3 with your latest nightly-build (from Oct 1), and the error still seems to be happening.
> My JVM options are currently set to:
> -Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
> Here's a few of the problem PDFs:
> 15MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf
> 13MB PDF:
> https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF
> Here's an example error stacktrace:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.HashMap.addEntry(HashMap.java:753)
> at java.util.HashMap.put(HashMap.java:385)
> at org.fontbox.cmap.CMap.addMapping(CMap.java:131)
> at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202)
> at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
> at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
> at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:343)
> at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
> at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:497)
> at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:218)
> at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
> at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
> at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
> at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
> at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
> at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
> at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
> at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
> at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
> at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
> at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:417)
> at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
> Finally, here's how the DSpace API is calling PDFBox:
> PDFTextStripper pts = new PDFTextStripper();
> PDFParser parser = null;
> String extractedText = null;
> try
> {
> parser = new PDFParser(source);
> parser.parse();
> extractedText = pts.getText(new PDDocument(parser.getDocument()));
> }
> finally
> {
> try
> {
> parser.getDocument().close();
> }
> catch(Exception e)
> {
> log.error("Error closing temporary PDF file: " + e.getMessage(), e);
> }
> }
> [comment on SourceForge]
> Originally sent by tdonohue.
> Logged In: YES
> user_id=1320825
> Originator: YES
> I neglected to mention both of these PDFs were initially image-based and were recently OCRed using Adobe Acrobat 8 Pro. I'm not sure that would matter for PDFBox to perform text extraction, but it's another commonality between these PDFs.
> Thanks in advance for any help you can provide!
> - Tim
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.