You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Kishore Kumar <ki...@live.com> on 2017/05/03 09:32:01 UTC

Fwd: ManifoldCF

Looping manifoldcf mailing list.

KK

________________________________
From: Matei Claudiu <cl...@optis.be>
Sent: Wednesday, May 3, 2017 2:57:52 PM
To: kishorejangid@live.com
Cc: Quirynen Jasper
Subject: ManifoldCF

Hi Kishore Kumar,

Thanks for developing ManifoldCF.

I have a question about it. I am trying to use the Windows Share connector together with Tika.
The problem is that after I index some files, I get the following error:

agents process ran out of memory - shutting down
java.lang.OutOfMemoryError: Java heap space
      at java.util.Arrays.copyOf(Arrays.java:3308)
      at java.util.BitSet.ensureCapacity(BitSet.java:337)
      at java.util.BitSet.expandTo(BitSet.java:352)
      at java.util.BitSet.set(BitSet.java:447)
      at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
      at org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
      at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
      at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
      at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
      at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
      at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
      at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:482)
      at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
      at org.apache.tika.parser.code.SourceCodeParser.parse(SourceCodeParser.java:120)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
[Thread-355] INFO org.eclipse.jetty.server.ServerConnector - Stopped ServerConnector@418c5a9c{HTTP/1.1}{0.0.0.0:8345}
[Thread-355] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.w.WebAppContext@387a8303{/mcf-api-service,file:/private/var/folders/nn/w4hqd84d42j6b4g1wdpdzpwr0000gn/T/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-any-1139783112420177477.dir/webapp/,UNAVAILABLE}{/Users/claudiu/Optis/Dev/manifoldcf/apache-manifoldcf-2.6-src/dist/example/./../web/war/mcf-api-service.war}
[Thread-355] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.w.WebAppContext@69504ae9{/mcf-authority-service,file:/private/var/folders/nn/w4hqd84d42j6b4g1wdpdzpwr0000gn/T/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-authority-service-any-4837742264173809485.dir/webapp/,UNAVAILABLE}{/Users/claudiu/Optis/Dev/manifoldcf/apache-manifoldcf-2.6-src/dist/example/./../web/war/mcf-authority-service.war}

I already have increased the Java memory to 8GB but this doesn’t look like a scalable solution.

I noticed that I don’t get any errors when I exclude Tika.

Do you see a solution for this?

Thank you,

Claudiu Matei

Re: ManifoldCF

Posted by Karl Wright <da...@gmail.com>.
Hi Claudiu,

First, it looks like you are running MCF as a single process. That is fine;
if you were running a multiprocess setup you'd want to be sure to increase
the memory size of all the agents processes, and not worry about any other
MCF processes.

Second, when you put Tika in the pipeline, potentially each worker thread
can be using Tika resources at the same time.  MCF uses Tika in a streaming
way.  We don't have any real control over Tika other than that.  But you
can limit this by reducing the number of Tika connections to some lower
number.  The default is 10 but for experimentation sake I'd try reducing
that down to even lower, e.g. 2-5.  That should limit the maximum memory
consumption.

Third, if the problem *continues* even with that restriction, it's worth
trying to find which document it is that is causing Tika to run out of
memory.  The MCF logs will be a big help here.  Each line contains the
thread ID, which should be helpful.  Please bear in mind that because of
the multi-threaded nature of MCF, the actual document causing the problem
might not be the one that finally causes the OOM.  Unless you reduce the
max number of Tika connections to 1, finding the exact document will be
hard.

If the actual failure document can be included in a bug report for the TIKA
team, that would be ideal.

Please let me know what happens.

Karl


On Wed, May 3, 2017 at 5:32 AM, Kishore Kumar <ki...@live.com>
wrote:

> Looping manifoldcf mailing list.
>
> KK
>
> ------------------------------
> *From:* Matei Claudiu <cl...@optis.be>
> *Sent:* Wednesday, May 3, 2017 2:57:52 PM
> *To:* kishorejangid@live.com
> *Cc:* Quirynen Jasper
> *Subject:* ManifoldCF
>
>
> Hi Kishore Kumar,
>
>
>
> Thanks for developing ManifoldCF.
>
>
>
> I have a question about it. I am trying to use the Windows Share connector
> together with Tika.
>
> The problem is that after I index some files, I get the following error:
>
>
>
> agents process ran out of memory - shutting down
>
> java.lang.OutOfMemoryError: Java heap space
>
>       at java.util.Arrays.copyOf(Arrays.java:3308)
>
>       at java.util.BitSet.ensureCapacity(BitSet.java:337)
>
>       at java.util.BitSet.expandTo(BitSet.java:352)
>
>       at java.util.BitSet.set(BitSet.java:447)
>
>       at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(
> BoilerpipeHTMLContentHandler.java:267)
>
>       at org.apache.tika.parser.html.BoilerpipeContentHandler.characters(
> BoilerpipeContentHandler.java:155)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.SecureContentHandler.characters(
> SecureContentHandler.java:270)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.SafeContentHandler.access$001(
> SafeContentHandler.java:46)
>
>       at org.apache.tika.sax.SafeContentHandler$1.write(
> SafeContentHandler.java:82)
>
>       at org.apache.tika.sax.SafeContentHandler.filter(
> SafeContentHandler.java:140)
>
>       at org.apache.tika.sax.SafeContentHandler.characters(
> SafeContentHandler.java:287)
>
>       at org.apache.tika.sax.XHTMLContentHandler.characters(
> XHTMLContentHandler.java:278)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.xpath.MatchingContentHandler.characters(
> MatchingContentHandler.java:85)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.SecureContentHandler.characters(
> SecureContentHandler.java:270)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.apache.tika.sax.ContentHandlerDecorator.characters(
> ContentHandlerDecorator.java:146)
>
>       at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:994)
>
>       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:482)
>
>       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>
>       at org.apache.tika.parser.code.SourceCodeParser.parse(
> SourceCodeParser.java:120)
>
>       at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
>
>       at org.apache.tika.parser.CompositeParser.parse(
> CompositeParser.java:280)
>
>       at org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:120)
>
>       at org.apache.tika.parser.DelegatingParser.parse(
> DelegatingParser.java:72)
>
> [Thread-355] INFO org.eclipse.jetty.server.ServerConnector - Stopped
> ServerConnector@418c5a9c{HTTP/1.1}{0.0.0.0:8345}
>
> [Thread-355] INFO org.eclipse.jetty.server.handler.ContextHandler -
> Stopped o.e.j.w.WebAppContext@387a8303{/mcf-api-service,file:/
> private/var/folders/nn/w4hqd84d42j6b4g1wdpdzpwr0000gn
> /T/jetty-0.0.0.0-8345-mcf-api-service.war-_mcf-api-service-
> any-1139783112420177477.dir/webapp/,UNAVAILABLE}{/Users/
> claudiu/Optis/Dev/manifoldcf/apache-manifoldcf-2.6-src/
> dist/example/./../web/war/mcf-api-service.war}
>
> [Thread-355] INFO org.eclipse.jetty.server.handler.ContextHandler -
> Stopped o.e.j.w.WebAppContext@69504ae9{/mcf-authority-service,file:/
> private/var/folders/nn/w4hqd84d42j6b4g1wdpdzpwr0000gn
> /T/jetty-0.0.0.0-8345-mcf-authority-service.war-_mcf-
> authority-service-any-4837742264173809485.dir/webapp/,UNAVAILABLE}{/Users/
> claudiu/Optis/Dev/manifoldcf/apache-manifoldcf-2.6-src/
> dist/example/./../web/war/mcf-authority-service.war}
>
>
>
> I already have increased the Java memory to 8GB but this doesn’t look like
> a scalable solution.
>
>
>
> I noticed that I don’t get any errors when I exclude Tika.
>
>
>
> Do you see a solution for this?
>
>
>
> Thank you,
>
>
>
> Claudiu Matei
>