You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Donald Van den Driessche (JIRA)" <ji...@apache.org> on 2019/04/09 06:41:00 UTC

[jira] [Commented] (PDFBOX-4489) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

    [ https://issues.apache.org/jira/browse/PDFBOX-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813044#comment-16813044 ] 

Donald Van den Driessche commented on PDFBOX-4489:
--------------------------------------------------

After further investigation and monitoring the memory usage during the full ManifoldCF process, the conclusion is that the memory spikes at certain times, which causes the ManifoldCF to crash.

We started downloading the files in the connector before handing them over to the TIKA-parser.

When manually downloading it from the site and running it through the pdfbox-app there is no error. Added file "xid-515432_1_good.pdf"

When running the downloaded (via the connector) file through the pdfbox-app we get the same error as Manifold throws. Added file "xid-515432_1_bad.pdf"

There was no difference using java 8 or 11.

We tried running the pdfbox app with more memory locally (16g) and java 11 and then it "processes" the file but with a different error.
Same happened on java 8 with 24g.
{code:java}
java -Xmx16g -jar ../pdfbox-app-2.0.14.jar ExtractText xid-515432_1.pdf Apr 09, 2019 8:20:37 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Could not read embedded TTF for font ABCDEE+Calibri,Bold java.io.EOFException at org.apache.fontbox.ttf.MemoryTTFDataStream.readUnsignedShort(MemoryTTFDataStream.java:120) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:120) at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98) at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78) at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:353) at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150) at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:198) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:869) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:505) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:479) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:152) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.tools.ExtractText.extractPages(ExtractText.java:375) at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:272) at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96) at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) Apr 09, 2019 8:20:37 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init> WARNING: Using fallback font ‘Helvetica-Bold’ for ‘ABCDEE+Calibri,Bold’ Apr 09, 2019 8:20:37 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init> INFO: OpenType Layout tables used in font ABCDEE+Calibri are not implemented in PDFBox and will be ignored
{code}
 

> Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4489
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4489
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.14
>            Reporter: Donald Van den Driessche
>            Priority: Major
>         Attachments: xid-515432_1_bad.pdf, xid-515432_1_good.pdf
>
>
> When using the internal Tika extractor in a Manifold Job on certain occasions I get an Out of Memory Error.
> {code:java}
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of memory - shutting down
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: Java heap space
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471)
> Mar 16 14:20:06 manifold01 manifoldcf[15747]: at digital.formica.manifold.connector.transformation.fetchwebresource.WebresourceFetchTransformationConnector.addOrReplaceDocumentWithException(WebresourceFetchTransformationConnector.java:118)
> {code}
> I've allocated 8g of heap size, Installed the latest version of Tika (1.20) and PDFBOX (2.0.14).
> But no solutions found.
> After a heap dump and analyzing this dump, I notice that it is the Integer class that takes about 2.6g of memory. 
> Any suggestions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org