You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Donald Van den Driessche (JIRA)" <ji...@apache.org> on 2019/03/19 08:50:00 UTC

[jira] [Created] (CONNECTORS-1593) Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

Donald Van den Driessche created CONNECTORS-1593:
----------------------------------------------------

             Summary: Memory issue on org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)
                 Key: CONNECTORS-1593
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1593
             Project: ManifoldCF
          Issue Type: Bug
          Components: Tika extractor
    Affects Versions: ManifoldCF 2.12
            Reporter: Donald Van den Driessche


I have created an Issue with fontbox too: 

 

When using the internal Tika extractor in a Manifold Job on certain occasions I get an Out of Memory Error.
{code:java}
Mar 16 14:20:06 manifold01 manifoldcf[15747]: agents process ran out of memory - shutting down

Mar 16 14:20:06 manifold01 manifoldcf[15747]: java.lang.OutOfMemoryError: Java heap space

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readLangSysTable(GlyphSubstitutionTable.java:147)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptTable(GlyphSubstitutionTable.java:129)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.readScriptList(GlyphSubstitutionTable.java:98)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.GlyphSubstitutionTable.read(GlyphSubstitutionTable.java:78)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:349)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:199)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:75)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:147)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:117)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaParser.parse(TikaParser.java:74)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.transformation.tika.TikaExtractor.addOrReplaceDocumentWithException(TikaExtractor.java:235)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddEntryPoint.addOrReplaceDocumentWithException(IncrementalIngester.java:3226)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$PipelineAddFanout.sendDocument(IncrementalIngester.java:3077)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester$MonitoredAddActivityWrapper.sendDocument(IncrementalIngester.java:3471)

Mar 16 14:20:06 manifold01 manifoldcf[15747]: at digital.formica.manifold.connector.transformation.fetchwebresource.WebresourceFetchTransformationConnector.addOrReplaceDocumentWithException(WebresourceFetchTransformationConnector.java:118)
{code}
I've allocated 8g of heap size, Installed the latest version of Tika (1.20) and PDFBOX (2.0.14).
But no solutions found.

After a heap dump and analyzing this dump, I notice that it is the Integer class that takes about 2.6g of memory. 

Any suggestions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)